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Project  Final  Report 


The  research  carried  out  under  this  grant  investigated  the  automated  design  of  problem-specific  repre¬ 
sentations  for  approximately  solving  Markov  decision  processes  (MDPs),  a  widely  used  model  of  sequential 
decision  making.  Much  past  work  on  solving  MDPs  in  adaptive  dynamic  programming  and  reinforcement 
learning  has  assumed  representations,  such  as  basis  functions,  are  provided  by  a  human  expert.  This 
assumption  makes  it  difficult  to  design  agents  that  can  solve  novel  collections  of  tasks.  The  research  in¬ 
vestigated  a  variety  of  approaches  to  automatic  basis  construction,  including  reward-sensitive  and  reward- 
invariant  methods,  diagonalization  and  dilation  methods,  as  well  as  orthogonal  and  overcomplete  repre¬ 
sentations.  A  unifying  perspective  on  the  various  basis  construction  methods  emerges  from  showing  they 
result  from  different  power  series  expansions  of  value  functions,  including  the  Neumann  series  expansion, 
the  Laurent  series  expansion,  and  the  Schultz  expansion. 

The  research  also  develops  new  computational  algorithms  for  learning  to  solve  Markov  decision  pro¬ 
cesses  with  an  overcomplete  representation.  The  main  idea  is  to  use  a  L\  regularized  penalty  function 
that  promotes  finding  sparse  representations  of  value  functions.  A  key  idea  investigated  in  the  research  is 
the  use  of  mirror-descent  optimization  methods  to  find  sparse  representations  of  value  functions.  A  new 
regularized  off-policy  bradient  temporal-difference  (TD)  method  was  explored,  which  combines  off-policy 
convergence,  robustness  to  irrelevant  features,  and  scalability  to  large  problems.  Experimental  results 
show  promising  performance. 
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1  Introduction 

The  automated  design  of  representations  is  a  longstanding  challenge  in  artificial  intelligence  and  machine 
learning.  The  overall  goal  of  the  research  carried  out  was  to  explore  algorithmic  methods  of  constructing 
representations  for  solving  sequential  decision  making  tasks.  Research  in  sequential  decision  making 
combines  work  in  a  number  of  areas,  including  artificial  intelligence  (Al)  [71],  dynamic  programming  (DP) 
[9],  operations  research  (OR)  [69],  as  well  as  machine  learning  (ML)  [11],  Markov  decision  processes 
(MDPs)  [24,  69]  have  emerged  as  a  core  mathematical  framework  for  modeling  sequential  decision  making. 
MDPs  model  sequential  decision  making  under  uncertainty,  where  states  are  fully  observable,  and  actions 
cause  a  stochastic  transition  among  states  in  a  fixed  constant  unit  of  time.  Extensions  of  MDPs  to  semi- 
Markov  decision  processes  (SMDPs)  [25]  have  been  used  to  model  temporally  extended  actions  and  exploit 
constraints  from  hierarchical  task  decompositions  [2],  The  basic  MDP  model  has  also  been  extended  to 
situations  where  states  are  hidden,  leading  to  the  partially  observable  MDP  (POMDP)  model  [33,  76]. 

While  traditional  methods  for  solving  MDPs  require  complete  knowledge  of  the  reward  (or  cost)  function 
and  transition  model,  there  has  been  substantial  research  over  the  past  two  decades  on  solving  MDPs 
using  sample-based  methods.  In  particular,  approximate  dynamic  programming  (ADP)  [68],  neuro-dynamic 
programming  [8],  and  reinforcement  learning  (RL)  [80]  are  closely  related  approaches  for  solving  large 
MDPs.  The  primary  strengths  of  these  frameworks  are  the  ability  to  learn  without  a  model,  crucial  in  large 
state  spaces;  the  incremental  nature  of  learning,  using  methods  like  temporal-difference  learning  (TD)  [81]; 
and  the  use  of  modern  function  approximation  architectures,  from  nonparametric  kernel  methods  [73]  to 
parametric  approaches  like  neural  networks  [8]  and  graphical  models  [35,  63].  RL,  NDP,  and  ADP  have  been 
been  highly  successful  in  solving  some  large  MDPs,  including  backgammon  [83],  cellphone  scheduling  [8], 
elevator  scheduling  [80],  helicopter  control  [58]  humanoid  robotics  [66],  multiagent  vehicle  scheduling  [68], 
and  Tetris  [8]. 

Yet,  despite  these  many  successes,  a  fundamental  limitation  of  many  of  these  approaches  is  that  they 
depend  on  a  human  designer  choosing  a  set  of  features  or  basis  functions  mapping  the  original  state  space 
into  a  lower-dimensional  (or  higher-dimensional)  discrete  or  continuous  space.  This  proposal  addresses  the 
problem  of  automatically  constructing  intermediate  problem-dependent  representations  of  an  MDP,  which 
simplifies  its  solution  as  well  as  enables  the  solution  of  related  problems.  This  proposal  investigates  a  va¬ 
riety  of  different  approaches  to  automatic  basis  generation,  including  reward-sensitive  and  reward-invariant 
methods,  diagonalization  vs.  dilation  methods,  orthogonal  vs.  overcomplete  bases,  and  “flat”  vs.  multiscale 
methods.  A  unifying  perspective  on  the  different  basis  construction  methods  emerges  by  showing  that  they 
correspond  to  different  power  series  expansions  of  the  discounted  value  function  of  a  policy.  For  example, 
Krylov  bases  [64,  67,  61 , 72]  and  proto-value  functions  [49]  can  be  related  to  terms  in  the  Neumann  series 
expansion.  Drazin  bases  [46]  correspond  to  terms  in  the  Laurent  series  expansion,  which  builds  on  the 
theory  of  generalized  spectral  inverses  in  Markov  chains  [14]  and  MDPs  [69].  Finally,  multiscale  diffusion 
wavelet  bases  [18,  48]  corresponds  to  rewriting  the  Neumman  series  using  a  product  of  dyadic  powers 
using  the  Schultz  expansion  [10]. 

The  research  investigated  a  series  of  computational  challenges  involved  in  scaling  automatic  basis  con¬ 
struction  approaches  to  large  MDPs.  In  the  model-based  setting,  compact  representations  of  bases  will  be 
constructed  using  factored  dynamic  Bayes  net  (DBN)  models  using  decision  diagrams  to  compactly  repre¬ 
sent  transition  matrices  and  reward  functions  [23].  Decision  diagram  representations  of  basis  functions  will 
be  developed  for  hierarchical  semi-Markov  decision  processes  [2]  and  concurrent  factored  SMDPs  [70].  A 
multiscale  wavelet-based  algorithm  for  computing  and  storing  basis  functions  will  be  studied.  Basis  func¬ 
tions  for  large  SMDPs  will  be  constructed  by  exploiting  task  hierarchies  and  temporally  extended  actions. 
Sparse  bases  will  be  constructed  using  L^based  regularization  [29,  36].  Incremental  algorithms  for  com¬ 
puting  spectral  bases  in  the  model-free  reinforcement  learning  setting  will  be  developed.  The  extension  of 
basis  construction  methods  to  POMDPs  will  be  studied.  A  broad  algorithmic  framework  for  learning  repre¬ 
sentation  and  control  for  solving  MDPs  will  be  investigated,  where  both  the  low-dimensional  representation 
of  an  MDP  as  well  as  the  approximately  optimal  policy  are  simultaneously  learned. 
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2  Technical  Background 


2.1  Markov  Decision  Processes 

A  Markov  decision  process  (MDP)  models  a  decision  maker  (or  “agent”)  who  interacts  with  an  external 
environment  by  taking  actions  [24,  69].  The  sequential  decision  problem  is  formulated  as  a  set  of  “states”, 
each  of  which  represents  a  particular  situation  in  the  decision  problem.  For  example,  a  state  may  be  a 
configuration  of  pieces  in  a  game  of  backgammon  or  chess,  the  location  of  a  robot,  the  number  of  jobs 
waiting  in  a  queueing  system  and  so  on.  Actions  cause  the  environment  to  stochastically  transition  to  a  new 
state.  The  desirability  of  a  state  is  determined  by  an  immediate  “reward”  or  “payoff”.  The  goal  of  the  agent 
is  to  choose  actions  such  that  it  maximizes  its  long  term  payoffs.  The  transition  from  one  state  to  the  next 
is  governed  by  a  probability  distribution  that  reflects  the  uncertainty  in  the  outcome  of  decisions. 

Formally,  a  discrete  Markov  decision  process  (MDP)  M  =  (S,A,Pas,,Raa,)  is  defined  by  a  finite  set  of 
discrete  states  S,  a  finite  set  of  actions  A,  a  transition  model  Pas,  specifying  the  distribution  over  future 
states  s'  when  an  action  a  is  performed  in  state  s,  and  a  corresponding  reward  model  Rass,  specifying  a 
scalar  cost  or  reward.  In  continuous  Markov  decision  processes,  the  set  of  states  c  Rd. 

A  value  function  is  a  mapping  S  -»  K.  or  equivalently  (in  discrete  MDPs)  a  vector  e  M|s|.  Given  a  policy 
7r :  S  A  mapping  states  to  actions,  its  corresponding  value  function  V n  specifies  the  expected  long-term 
discounted  sum  of  rewards  received  by  the  agent  in  any  given  state  s  when  actions  are  chosen  using  the 
policy.  The  value  function  can  be  written  as  a  Neumann  series  involving  the  one-step  expected  rewards  i?7r 
(a  column  vector  of  size  |S|),  and  the  transition  matrix  Pn: 

Vn  =  (I  -  7P7r)"1i?7r  =  (/  +  7 P”  +  (-fP71)2  +  . . . )R 77 

The  Neumann  expansion  yields  a  series  of  basis  vectors  called  Krylov  bases,  which  can  be  used  to  ap¬ 
proximately  solve  MDPs  [72,  67,  64,  61].  We  will  also  introduce  another  expansion  of  the  value  function  in 
terms  of  spectral  inverses  below.  Given  any  deterministic  policy  n,  the  operator  Tn  is  defined  as 

T”(V)(s)  =  Rm(a)  +7  £  P$a)V(s'). 

s'es 

Here  Rsn(s)  =  X)s/6s  Pi!*1  P7Ts<sl>  is  the  expected  immediate  reward.  It  can  be  shown  that  the  value  function 
V1T{s)  is  a  fixed  point  of  Tn,  that  is  Tn(Vn)(s)  =  yw(s).  The  Bellman  error  of  an  approximate  value  function 
V  is  defined  as  Tn(V)-V,  and  plays  a  key  role  in  both  basis  construction  as  well  as  approximation  methods 
to  solve  MDPs.  The  optimal  policy  n*  and  optimal  value  function  V71'  of  an  MDP  are  defined  as: 

V”*(s)  =  V*(s)  >  Vn(s)  Vt r,s  €  S. 

The  optimal  policy  n*  is  not  in  general  unique.  However,  any  optimal  policy  n*  defines  the  same  unique 
optimal  value  function.  The  optimal  value  function  can  be  shown  to  be  a  fixed  point  of  another  operator  T* , 
defined  as  follows. 

T*(V)(s)  =  max  ( Rsa  +  7  £  ^V{sf) 

\  s'es 

Action  value  functions  are  mappings  from  S  x  A  ->•  R,  and  represent  a  convenient  reformulation  of  the 
value  function.  The  optimal  action  value  Q*(x,a)  is  defined  as  the  long-term  value  of  the  nonstationary 
policy  performing  a  first,  and  then  acting  optimally  according  to  V *: 

Q*(s,a)  =  E  [rt+\  +  7maxQ*(st+i,  a')\st  =  s,at  =  a) 

where  V*(s)  =  maxa  Q*(s,a),  rt+i  is  the  actual  reward  received  at  the  next  time  step,  and  st+i  is  the  state 
resulting  from  executing  action  a  in  state  st.  The  corresponding  Bellman  equation  for  Q*(x,a)  is  given  as 

Q* (s,  a)  =  Rsa  +  'y'V]  Pis'  max Q*{s',a!)  (1 ) 

z — '  a' 

s' 

One  advantage  of  action  value  functions  is  that  they  can  be  used  to  directly  estimate  the  optimal  policy, 
where  7r*(s)  e  argmaxaQ*(s,a). 


2 


2.2  Approximation  Methods  for  Solving  MDPs 

For  large  discrete  or  continuous  MDPs,  it  is  not  possible  to  compute  the  value  function  exactly,  but  instead 
the  best  approximation  within  a  family  of  parametric  or  non-parametric  representations  is  sought  [8].  We 
restrict  our  focus  to  linear  architectures,  where  the  approximation  V  =  $w  lies  within  the  linear  span  of  a 
set  of  basis  functions  forming  the  columns  of  the  basis  matrix  <J>.  The  basis  function  matrix  $  is  an  |5|  x  k 
matrix,  where  each  column  represents  a  basis  function,  and  each  row  specifies  the  feature  vector  </>(s)  = 
[</>i(s), . . . ,  Ms)}t  mapping  a  particular  state  s  to  Rk.  Applying  the  Bellman  operator  Tn  to  V  =  Tu;  may 
result  in  a  vector  that  lies  outside  the  column  space  of  $,  and  so  requires  a  projection  step  to  compute  the 
next  iterate.  Approximation  algorithms  for  solving  MDPs  can  be  categorized  into  two  main  types:  Bellman 
residual  algorithms,  which  attempt  to  minimize  the  Bellman  error  min™  ||T7r($'u;)  -  $w\\^,  and  fixed  point 
methods,  which  attempt  to  minimize  the  projected  Bellman  error  min™  j|AT|T7r($w)  -  $w\\p«  [54],  Here, 
||.||p,r  refers  to  a  norm  defined  by  the  invariant  distribution  pn  of  the  Markov  chain  induced  by  the  policy 
7 r.  M|  is  a  projection  matrix  onto  the  space  spanned  by  the  basis  4>.  The  least  squares  expression  for 
Mjjj  =  where  <£>*  =  (<t>T DP^^)~1^T Dp*  is  the  pseudo-inverse  of  $  under  the  norm  defined  by  pK.  Dp * 
is  a  diagonal  matrix  whose  entries  are  given  by  the  stationary  distribution  pn .  The  solution  to  least-squares 
fixed  point  projection  is  given  by  the  following  equation. 

wpp  =  Appbpp,  where  App  =  [<S>TDP*(I  -  7 Pw)<f>]  ,  and  bFP  =  <f>TDp*Rn  (2) 

In  the  model-free  reinforcement  learning  setting,  the  A  matrix  and  b  vector  can  be  estimated  from  samples 
(st,at,rt,st+ 1)  generated  from  simulation.  In  particular,  the  update  equations  are: 

Afp  =  Afp  +  (j>{st){(j){st)  —  7^(st+i))T  1>fp  =  bFp  +  <j>(st)rt. 

The  family  of  fixed-point  algorithms  includes  least-squares  TD  (LSTD)  [12],  least-squares  policy  evaluation 
(LSPE)  [55],  as  well  as  least-squares  policy  iteration  (LSPI)  [41],  LSPI  uses  a  least-squares  approach  to 
approximate  the  action-value  function  Qn(s,  a)  for  a  policy  n.  In  this  case,  basis  functions  <j>(s,  a)  are  defined 
over  states  and  actions:  Qn(s,a;w )  =  Yj)=i  <t>j(s,a)wj,  where  the  w:i  are  weights  or  parameters  that  can 
be  determined  using  a  least-squares  method.  We  have  recently  explored  a  new  family  of  hybrid  least- 
squares  algorithms  that  interpolate  between  the  fixed  point  approach  and  the  Bellman  residual  approach 
[28].  Another  important  class  of  approximation  methods  for  solving  MDPs  include  linear  programming  (LP) 
based  approaches,  which  also  have  traditionally  used  a  fixed  set  of  bases  [19,  22], 


2.3  Generalized  Spectral  Inverses  in  Markov  Chains  and  MDPs 

The  discounted  value  function  can  also  be  expressed  as  a  Laurent  series  expansion,  which  leads  to  another 
interesting  type  of  basis.  Given  a  transition  matrix  P,  the  low-rank  singular  matrix  L  =  I  -  P  can  be 
interpreted  as  a  generalized  Lapiacian  operator  [1],  since  its  row  sums  are  0  and  its  off-diagonal  elements 
are  non-positive.  Figure  1  illustrates  a  simple  example.  A  family  of  spectral  inverses  -  the  Drazin  inverse 
and  a  special  instance  of  it  called  the  group  inverse  -  have  been  widely  studied  in  Markov  chains  [14]  and 
MDPs  [69], 


Figure  1 :  The  generalized  Lapiacian  operator  associated  with  a  Markov  chain. 

The  index  of  a  square  matrix  A  e  Cn  x  Cn  is  the  smallest  nonnegative  integer  k  such  that  7 Z{Ak)  — 
U(Ak+1).  Here,  11(A)  is  the  range  (or  column  space)  of  matrix  A.  For  example,  a  nonsingular  (square) 
matrix  A  has  index  0,  because  1Z(A°)  =  1Z(I)  =  11(A).  The  Lapiacian  L  =  I  -  P  of  a  Markov  chain  has 
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index  1.  We  can  now  define  the  Drazin  inverse  AD  of  a  square  matrix  A  as  the  unique  matrix  satisfying  the 
following  properties: 

AdAAd=Ad,  AdA  =  AAd,  Ak+1AD  =  Ak,  (3) 

where  k  is  the  index  of  A.  When  k  =  1,  the  Drazin  inverse  becomes  the  group  inverse,  which  satisfies  the 
last  property  for  the  case  k  =  1,  so  A2AD  =  A.  It  follows  that  AADA  =  A.  For  aperiodic  chains,  a  simple 
interpretation  of  the  group  inverse  of  the  Laplacian  can  be  given  based  on  the  identities: 


n—  1 

LD  =  (I-  P)D  =  (I-P  +  P *)-1  -  P*  =  lim  V  (Pk  -  P*)  . 

n— >oo  '  v  ' 

k—0 


In  other  words,  L D(i,j)  in  the  group  inverse  matrix  is  the  difference  between  the  expected  number  of  visits 
to  state  j  starting  in  state  i  following  the  transition  matrix  P  versus  the  expected  number  of  visits  to  j 
following  the  long-term  limiting  matrix  P*. 

The  Drazin  inverse  is  useful  in  linking  average-reward  and  discounted  MDPs  [69].  Define  the  Laplacian 
matrix  L 77  =  /  -  P77,  where  P77  is  a  transition  matrix.  We  reformulate  the  Bellman  equation  in  terms  of  an 
interest  rate  p  =  (1  -  7)7-1  or  equivalently,  7  =  The  interpretation  of  p  as  an  interest  rate  follows  from 
the  property  that  if  a  reward  of  1  is  “invested”  at  the  first  time  step,  then  1  +  p  is  the  amount  received  at  the 
next  time  step.  The  interest  rate  formulation  of  the  discounted  value  function  associated  with  a  policy  7 r  can 
be  written  as: 

V77  =  (I  -  'yP7T)~1R7T  =  (1  +  p)(pl  +  L77)"1#77 


For  p  >  0,  the  matrix  (pi  +  L77)-1  is  called  the  resolvent  of  -L77  at  p.  For  7  <  1,  the  spectral  radius  of 
7 P77  <  1,  and  hence  the  inverse  (I  -  7 P77)-1  exists.  Consequently,  the  resolvent  (pi  +  L77)-1  also  exists. 
Given  a  discounted  MDP  M  and  policy  7 r,  the  value  function  V 77  can  be  expressed  in  terms  of  the  gain  and 
bias  of  the  associated  Markov  reward  process  Mw  =  (P77,  P77),  and  the  Drazin  inverse  of  the  Laplacian,  as 
follows  [69]: 

V"  =  (1  +  P)  (V  V  +  h*  +  £(— pr((L^r+1ir)  ,  (4) 


n= 0 


where  p  <  a(I  -  Pn)  (the  spectral  radius  of  I  -  P 7r).  The  gain  g 77  =  (P77)*^77  represents  the  expected 
reward  (or  cost)  earned  at  each  step.  The  bias  represents  the  average-adjusted  sum  of  rewards,  namely 
h 77  =  ((Pny  -  (P77)*)^77.  As  we  show  below,  the  successive  terms  of  this  expansion  can  be  used  as 

basis  vectors,  analogous  to  the  use  of  Krylov  bases  from  the  Neumann  expansion  of  a  discounted  value 
function. 


3  Automatic  Construction  of  Basis  Functions 

First,  we  describe  a  broad  framework  for  constructing  new  features,  or  basis  functions,  for  representing 
value  functions.  The  framework  shows  how  different  approaches  can  be  viewed  as  power-series  expansions 
of  the  value  function. 

3.1  Basis  Representations  from  Power  Series  Expansions  of  the  Value  Function 

A  fundamental  question  is  how  to  automatically  generate  the  basis  matrix  $  used  in  approximation  methods 
for  solving  MDPs.  1  We  provide  a  unifying  perspective  in  this  section,  showing  several  basis  representations 
emerge  naturally  from  different  power  series  expansion  of  the  value  function.  Figure  2  illustrates  some  of 
the  dimensions  along  which  the  various  construction  methods  can  be  categorized. 

1  We  distinguish  basis  construction  approaches  from  basis  adaptation  methods  [53,  87]  which  search  over  a  pre-defined  collection 
of  parametric  bases,  such  as  RBFs  or  polynomials. 
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invariant  sensitive 


Figure  2:  Different  approaches  to  basis  function  construction  can  be  categorized  as  shown  here. 


3.1.1  Spectral  Bases 

A  widely  used  approach  to  dimensionality  reduction  in  machine  learning  and  statistics  is  to  use  eigenvectors 
of  a  suitable  matrix,  as  in  principal  components  analysis  [31  ]  and  Laplacian  eigenmaps  [4],  A  natural  analog 
of  this  approach  is  to  diagonalize  the  transition  matrix  P77,  and  use  its  eigenvectors  as  a  basis  [64],  If  the 
transition  matrix  P77  is  reversible,  there  is  a  complete  set  of  eigenvectors  T*77  =  {4>l,  ■■  -  fin)  that  provides  a 
change  of  basis  in  which  the  transition  matrix  P77  is  representable  as  a  diagonal  matrix  P77  =  $77A77(<F77)T, 
where  A77  is  a  diagonal  matrix  of  eigenvalues.  Another  way  to  express  the  above  property  is  to  write  the 
transition  matrix  as  a  sum  of  projection  matrices  associated  with  each  eigenvalue:  ^  =  £r=*W(#)r, 
where  the  eigenvectors  form  a  complete  orthogonal  basis  (i.e.  ||  (j>?  ||2=  1  and  =  0 ,i  /  j). 

It  readily  follows  that  powers  of  P77  have  the  same  eigenvectors,  but  the  eigenvalues  are  raised  to  the 
corresponding  power  (i.e.,  (P77)^77  =  (Af)fc^7r).  Since  the  basis  matrix  $  spans  all  vectors  on  the  state 
space  S,  we  can  express  the  reward  vector  P77  in  terms  of  this  basis  as  P77  =  4>77a77,  where  a77  is  a  vector  of 
scalar  weights.  For  high  powers  of  the  transition  matrix,  the  projection  matrices  corresponding  to  the  largest 
eigenvalues  will  dominate  the  expansion.  Using  the  Neumann  expansion  of  the  value  function,  we  get 

oo  oo  n  n  oo  n  1  n 

i— 0  i— 0  k—  1  k—1  i=0  k—1  ^  1 

where  we  used  the  property  that  (P77)7^  =  (AJ)Vj-  Essentially,  the  value  function  V77  is  represented  as  a 
linear  combination  of  eigenvectors  of  the  transition  matrix.  In  order  to  provide  the  most  efficient  approxima¬ 
tion,  we  can  truncate  the  summation  by  choosing  some  small  number  m  <  n  of  the  eigenvectors,  preferably 
those  for  whom  is  large. 

The  spectral  approach  of  diagonalizing  the  transition  matrix  is  problematic  since  the  transition  matrix 
P77  cannot  be  assumed  to  be  reversible,  in  which  case  one  has  to  deal  with  complex  eigenvalues  (and 
eigenvectors).  Proto-value  functions  (PVFs)  [49,  26,  59]  are  an  orthogonal  reward-invariant  basis  using  the 
eigenvectors  of  the  symmetric  graph  Laplacian,  which  can  be  viewed  as  an  “approximate”  transition  model 
that  captures  the  topological  properties  of  the  underlying  state  space  manifold.  The  graph  is  constructed 
by  connecting  adjacent  states  in  an  MDP  reachable  from  each  other  under  any  policy.  In  a  continuous 
MDP,  the  states  are  sampled  by  simulation,  and  a  local  distance  metric  such  as  Euclidean  distance  is  used. 
The  initial  version  was  restricted  to  symmetric  graphs  on  discrete  MDPs  [51],  but  later  variants  included 
continuous  MDPs  [50],  directed  graphs  [26],  and  hierarchical  SMDPs  [59,  60].  We  have  recently  extended 
this  approach  to  partially  observable  MDPs  (POMDPs)  as  well,  where  the  graph  is  over  high-dimensional 
belief  states  [84], 

3.1.2  Krylov  Bases 

Another  basis  called  the  Krylov  basis  results  from  the  Neumann  expansion  of  the  value  function  [64,  67,  72]. 
Krylov  bases  are  generated  by  dilating  the  reward  function  by  powers  of  the  transition  matrix.  Given  the 


Vn  =  E(7  P*y$na 
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50  state  Chain  MDP 


Approximate  Value  Function 


Optimal  Value  Function  V* 


Policy  (1=N,  2=E,3=S,  4=W) 


Figure  3:  Left:  A  simple  “two-room”  robot  navigation  problem  modeled  as  an  MDP  has  100  states.  The 
agent  is  rewarded  by  100  for  reaching  a  corner  goal  state.  Compass  navigation  actions  are  stochastic, 
succeeding  with  probability  0.9.  Using  just  4  Drazin  bases,  the  optimal  policy  can  be  found.  Right:  Ex¬ 
perimental  comparison  of  several  ways  of  automatically  constructing  bases  on  a  50  state  chain  MDP  with 
rewards  in  state  10  and  41:  DBF  represents  Drazin  bases;  BEBFs  represent  Bellman  error  bases;  PVFs 
represent  proto-value  functions;  PVF-MP  represent  a  matching  pursuit  variant  of  PVFs.  See  Section  3.2  for 
an  explanation  of  the  plot. 


transition  matrix  P *  and  reward  function  R* ,  the  jth  Krylov  subspace  /C?  is  defined  as  the  space  spanned 
by  the  vectors: 

K*j  =  {K*,  P "iT ,  (P")2r, . . . ,  (P"y-1Kr}  (5) 

Note  that  KP i  c  KP 2  c  . . .,  such  that  for  some  to,  KPm  =  /C7rm+ 1  =  KP .  Thus,  1C*  is  the  P^-invariant  Krylov 
space  generated  by  P*  and  IT .  An  incremental  variant  of  the  Krylov-based  approach  is  called  Bellman 
error  basis  functions  (BEBFs),  which  is  useful  in  the  model-free  reinforcement  learning  setting  [34,  61].  As 
shown  in  Figure  3,  Krylov  (and  BEBF  bases  can  initially  generate  large  approximation  errors  if  the  first  few 
bases  are  highly  localized  around  the  region  of  a  sparse  reward. 

3.1.3  Drazin  Bases 

The  Laurent  series  expansion  of  the  value  function  similarly  motivates  another  basis  construction  method, 
which  combines  the  Laplacian  and  Krylov-based  approaches  [46].  The  Drazin  space  associated  with  a 
policy  7 r  and  reward  function  IP  is  defined  as  the  space  spanned  by  the  set  of  Drazin  vectors: 

=  {P*R*1  {17)DR*,  {$?)D)2EP,  ((L’r)D)m_1P7r}  (6) 

Intuitively,  in  dilating  the  reward  function,  we  are  not  using  the  transition  matrix  P* ,  but  rather  the  modified 
transition  matrix  P*  -  P*.  The  first  basis  vector  is  the  average-reward  or  gain  g*  =  P*R *  of  policy  n.  The 
second  basis  vector  is  the  product  of  (h*)D  and  RP .  Subsequent  basis  vectors  are  defined  by  the  dilation  of 
the  reward  function  Rn  by  higher  powers  of  the  Drazin  inverse  (h*)0  used  in  the  Laurent  series  expansion 
of  Vn .  Figure  3  shows  that  Drazin  bases  outperforms  the  other  bases  on  a  two-room  MDP.  However,  a 
drawback  of  Drazin  bases  is  that  they  are  computationally  expensive  to  generate.  A  key  challenge  to  be 
addressed  in  this  research  is  developing  incremental  algorithms  that  approximate  Drazin  bases. 
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3.1.4  Multiscale  Bases  using  the  Schultz  Expansion 

Finally,  we  describe  a  multiscale  policy-sensitive  and  overcomplete  basis  that  emerges  from  another  power 
series  representation  of  the  value  function  using  the  Schultz  expansion  [10].  This  trick  lets  us  rewrite  the 
Neumann  series  as  the  product  of  dyadic  powers  (powers  of  two)  of  the  discounted  transition  matrix  7 Pn: 

OO 

V"  =  (/  -  'yP7T)~1Rn  =  JJ  (/  +  (7  Pnfk  )Rn  (7) 

fc= 0 

In  practice,  the  Schultz  expansion  converges  far  more  quickly  than  the  Neumann  series.  It  is  closely  related 
to  Successive  Matrix  Squaring  (SMS)  methods  described  below  [15,  86].  However,  computing  the  dyadic 
powers  of  the  transition  matrix  seems  a  daunting  prospect.  In  past  work,  we  have  shown  how  the  multiscale 
diffusion  wavelet  algorithm  [1 8]  provides  an  efficient  representation  for  dyadic  powers  of  stochastic  matrices 
[44]  (see  Figure  4). 

The  diffusion  wavelet  framework  constructs  a  multi-resolution  decomposition  of  the  functions  on  a  state 
space  as  a  family  of  nested  subspaces  VJjDF,  D  •  •  •  D  V)  D  —  If  the  transition  matrix  P  is  interpreted  as 
an  operator  on  functions  on  the  state  space,  then  V,  is  defined  as  the  numerical  range,  up  to  precision  e,  of 
p23+1-i .  p  js  assumed  to  be  a  sparse  matrix,  and  that  the  numerical  rank  of  the  powers  of  P  is  assumed  to 
decay  rapidly  for  higher  powers.  A  diffusion  wavelet  tree  consists  of  orthogonal  diffusion  scaling  functions 
that  are  smooth  bump  functions,  with  some  oscillations,  at  scale  roughly  2J,  and  orthogonal  wavelets 
Tj  that  are  smooth  localized  oscillatory  functions  at  the  same  scale.  The  scaling  functions  4>;/  span  the 
subspace  V),  with  the  property  that  Vj+ 1  c  V:j,  and  the  span  of  ,  Wv  is  the  orthogonal  complement  of  V, 
into  Vj+ 1. 


Figure  4:  Compressed  representation  of  dyadic  powers  of  the  transition  matrix  Pn  in  the  two-room  MDP 
(shown  in  Figure  3),  from  the  first  (extreme  left)  to  the  sixth  (extreme  right).  Higher  dyadic  powers  are 
compressed  into  progressively  smaller  sized  arrays  by  representing  the  powers  using  a  learned  hierarchy 
of  scaling  function  bases.  For  example,  the  sixth  dyadic  power  (26  =  64)  is  effectively  of  size  1,  a  significant 
compression  from  its  original  100  x  100  size.  Entries  are  displayed  in  log10  scale. 

A  principal  attraction  of  diffusion  wavelet  representations  is  their  multiscale  nature:  they  provide  efficient 
ways  of  representing  powers  of  stochastic  matrices.  One  drawback  of  diffusion  wavelets  is  that  it  can  gen¬ 
erate  a  large  number  of  overcomplete  bases,  which  needs  to  be  effectively  pruned  using  a  basis  selection 
algorithm.  We  discuss  basis  selection  methods  below.  While  the  complexity  of  constructing  a  diffusion 
wavelet  tree  is  quadratic  in  the  size  of  the  matrix,  it  can  nevertheless  be  expensive  to  run  the  procedure 
on  large  stochastic  matrices.  We  propose  to  address  these  challenges  by  developing  a  modified  reward- 
sensitive  diffusion  wavelet  procedure,  which  generates  a  sparse  set  of  compact  factored  bases  for  a  specific 
reward  function,  as  well  as  to  develop  basis  selection  methods  that  exploit  L±  regularization. 


3.2  Experimental  Comparisons 

An  important  objective  of  the  research  is  to  compare  different  methods  for  automatically  constructing  basis 
functions.  Such  comparisons  can  be  done  in  several  different  settings:  (i)  Exact  model-based  experiments, 
which  assume  that  the  reward  function  and  transition  matrix  of  a  fixed  policy  n  are  known  (ii)  Approximate 
model-free  setting,  where  only  sample  trajectories  and  rewards  are  available.  Figure  3  compares  the  per¬ 
formance  of  Drazin  basis  functions  (DBFs),  Bellman-error  basis  functions  (BEBFs),  and  two  variants  of 
proto-value  functions  (PVFs  and  PVF-MP)  on  a  50  state  chain  MDP  [41],  The  two  actions  (go  left,  or  go 
right)  succeed  with  probability  0.9.  When  the  actions  fail,  they  result  in  movement  in  the  opposite  direction 
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with  probability  0.1.  The  two  ends  of  the  chain  are  treated  as  “dead  ends”.  Rewards  of  +1  are  given  in 
states  10  and  41.  The  PVF-MP  algorithm  selects  basis  functions  incrementally  based  upon  the  Bellman 
error,  where  basis  function  k+1  is  the  PVF  that  has  largest  inner  product  with  the  Bellman  error  resulting 
from  the  previous  k  basis  functions.  The  plots  in  the  figure  represent  the  following  error  terms,  following  the 
style  of  evaluation  proposed  in  [62], 

1.  Reward  error:  The  reward  error  measures  the  difference  between  the  actual  reward  Rn  and  the 

approximated  reward  i?J  =  projected  on  the  column  space  of  bases  $,  which  is 

plotted  in  the  figure  using  the  L2  norm  of  the  vector. 

2.  Weighted  feature  error:  The  weighted  feature  error  is  defined  as  the  L2  norm  of  the  vector  7A$w$, 

where  P$  =  is  the  projected  transition  matrix,  and 

=  P$  - 

=  (/  —  7P5)-1  r$ 

3.  Bellman  error:  The  Bellman  error  is  the  linear  sum  of  the  reward  error  and  the  feature  error.  2 

PVFs  can  result  in  a  large  reward  error  if,  as  in  this  case,  the  reward  function  is  poorly  approximated  by 
the  eigenvectors  of  the  combinatorial  Laplacian  on  the  chain  graph.  However,  PVFs  have  very  low  weighted 
feature  error.  The  overall  Bellman  error  is  largely  due  to  the  reward  error.  The  reward  error  for  BEBFs  is  by 
definition  0  as  Rn  is  a  basis  vector  itself.  However,  the  feature  error  is  extremely  high,  particularly  initially, 
until  around  15  bases  are  used.  Consequently,  the  Bellman  error  for  BEBFs  remains  high  initially.  Measured 
overall,  Drazin  bases  have  the  best  performance  overall  at  this  task.  All  three  types  of  error  are  moderate 
initially  and  reduce  quickly  to  0  using  around  5-6  bases,  three  times  as  quickly  as  BEBFs  and  substantially 
faster  than  PVFs  and  PVF-MPs. 

In  addition  to  comparisons  on  simplified  MDPs,  the  goal  of  the  research  is  to  also  perform  comparisons 
of  basis  construction  methods  on  larger  discrete  and  continuous  MDPs,  where  exact  models  are  unavailable 
or  difficult  to  learn.  A  rigorous  comparison  of  automatic  basis  generation  methods  would  provide  a  valuable 
addition  to  the  literature,  and  help  practitioners  who  want  to  apply  these  methods  to  large  MDPs. 

3.3  Combining  Different  Bases 

Since  different  basis  construction  methods  often  have  complementary  strengths  and  weaknesses,  combin¬ 
ing  them  may  be  an  effective  approach.  We  discuss  two  main  ideas  for  integration  of  different  bases. 

3.3.1  Integrating  Reward-Sensitive  and  Reward-Invariant  Bases 

One  of  the  key  goals  of  the  research  is  to  explore  ways  of  combining  reward-sensitive  and  reward-invariant 
bases.  There  are  several  reasons:  reward-invariant  bases  are  more  transferrable  and  can  be  applied  with 
many  different  reward  functions  on  a  state  space;  however,  they  may  be  less  effective  on  specific  reward 
functions  and  policies  [62].  In  semi-Markov  decision  processes,  where  the  overall  problem  is  hierarchically 
decomposed  into  a  set  of  simpler  SMDPs,  lower  level  subproblems  may  be  repeatedly  called  with  slightly 
varying  parameters  (e.g.,  a  common  “navigation”  routine  that  tasks  a  mobile  robot  to  different  rooms  in  a 
building).  In  this  case,  sharing  basis  functions  among  related  subtasks  seems  useful.  On  the  other  hand, 
at  higher  levels,  it  may  be  more  useful  to  use  reward-sensitive  bases. 

Graph-theoretic  bases,  such  as  proto-value  functions  and  geodesic  Gaussian  kernels  [79],  can  be  insen¬ 
sitive  to  the  reward  function  if  the  graph  is  constructed  without  taking  rewards  into  account.  One  approach 
to  constructing  reward-sensitive  graph-theoretic  bases  is  to  incorporate  rewards  as  part  of  the  weight  ma¬ 
trix  in  the  graph-based  Laplacian  operator.  The  directed  Laplacian  operator  [17]  provides  a  convenient 
way  of  incorporating  rewards  into  the  graph  representation,  since  it  allows  asymmetric  edges.  In  a  prelimi¬ 
nary  study,  we  found  directed  bases  outperformed  the  non-directional  bases  in  continuous  MDPs  [26]  and 

2The  reward  and  weighted  feature  error  vectors  can  individually  be  high,  and  yet  cancel  each  other  out.  While  the  Bellman  error  is 
a  natural  metric  to  compare  bases,  other  measures  related  to  the  quality  of  the  learned  policy  also  need  to  be  used. 
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SMDPs  [59].  Another  approach  is  to  construct  “hybrid”  bases.  For  example,  Petrik  [64]  proposed  combining 
“low-frequency”  global  Laplacian  eigenvectors  with  “high-frequency”  localized  Krylov  vectors,  which  helps 
alleviate  the  problem  of  large  initial  approximation  errors. 


4  Regularized  Off-Policy  Convergent  Reinforcement  Learning 

In  this  section,  we  describe  our  recent  work  on  regularized  sparse  methods  for  learning  value  functions. 
Our  main  ideas  are  to  combine  sophisticated  methods  of  optimization,  such  as  mirror  descent,  with  work  in 
reinforcement  learning. 

4.1  Regularized  Off-Policy  TD-Learning 

We  first  present  a  novel  regularized  off-policy  convergent  TD-learning  method  (termed  RO-TD),  which  is 
able  to  learn  sparse  representations  of  value  functions  with  low  computational  complexity.  The  algorithmic 
framework  underlying  RO-TD  integrates  two  key  ideas:  off-policy  convergent  gradient  TD  methods,  such 
as  TDC,  and  a  convex-concave  saddle-point  formulation  of  non-smooth  convex  optimization,  which  enables 
first-order  solvers  and  feature  selection  using  online  convex  regularization.  A  detailed  theoretical  and  exper¬ 
imental  analysis  of  RO-TD  is  presented.  A  variety  of  experiments  are  presented  to  illustrate  the  off-policy 
convergence,  sparse  feature  selection  capability  and  low  computational  cost  of  the  RO-TD  algorithm. 

Temporal-difference  (TD)  learning  is  a  widely  used  method  in  reinforcement  learning  (RL).  Although  TD 
converges  when  samples  are  drawn  “on-policy”  by  sampling  from  the  Markov  chain  underlying  a  policy  in 
a  Markov  decision  process  (MDP),  it  can  be  shown  to  be  divergent  when  samples  are  drawn  “off-policy”. 
Sutton  et  al.  [82]  introduced  convergent  off-policy  temporal-difference  learning  algorithms,  such  as  TDC, 
whose  computation  time  scales  linearly  with  the  number  of  samples  and  the  number  of  features.  Recently, 
a  linear  off-policy  actor-critic  algorithm  based  on  the  same  framework  was  proposed  in  [85]. 

Regularizing  reinforcement  learning  algorithms  leads  to  more  robust  methods  that  can  scale  up  to  large 
problems  with  many  potentially  irrelevant  features.  LARS-TD  [37]  introduced  a  popular  approach  of  combin¬ 
ing  li  regularization  using  Least  Angle  Regression  (LARS)  with  the  least-squares  TD  (LSTD)  framework. 
Another  approach  was  introduced  in  [30]  (LCP-TD)  based  on  the  Linear  Complementary  Problem  (LCP) 
formulation,  an  optimization  approach  between  linear  programming  and  quadratic  programming.  LCP-TD 
uses  “warm-starts”,  which  helps  significantly  reduce  the  burden  of  h  regularization.  A  theoretical  analysis  of 
h  regularization  was  given  in  [20],  including  error  bound  analysis  with  finite  samples  in  the  on-policy  setting. 

Another  approach  integrating  Dantzig  Selector  with  LSTD  was  proposed  in  [52],  overcoming  some  of 
the  drawbacks  of  LARS-TD.  An  approximate  linear  programming  for  finding  regularized  solutions  of  the 
Bellman  equation  was  presented  in  [65].  All  of  these  approaches  are  second-order  methods,  requiring 
complexity  approximately  cubic  in  the  number  of  (active)  features.  Another  approach  to  feature  selection 
is  to  greedily  add  new  features,  proposed  recently  in  [16].  Regularized  first-order  reinforcement  learning 
approaches  have  recently  been  proposed  primarily  using  the  on-policy  scenario.  Here,  the  off-policy  TD 
learning  problem  is  formulated  from  the  stochastic  optimization  perspective.  A  novel  objective  function  is 
proposed  based  on  the  linear  equation  formulation  of  the  TDC  algorithm.  The  optimization  problem  un¬ 
derlying  off-policy  TD  methods,  such  as  TDC,  is  reformulated  as  a  convex-concave  saddle-point  stochastic 
approximation  problem,  which  is  both  convex  and  incrementally  solvable.  A  detailed  theoretical  and  experi¬ 
mental  study  of  the  RO-TD  algorithm  is  presented. 

Here  is  a  brief  roadmap  to  the  rest  of  the  section.  Section  4.2  reviews  the  basics  of  MDPs,  RL  and  recent 
work  on  off-policy  convergent  TD  methods,  such  as  the  TDC  algorithm.  Section  4.3  introduces  the  proximal 
gradient  method  and  the  convex-concave  saddle  point  formulation  of  non  smooth  convex  optimization.  Sec¬ 
tion  4.5  presents  the  new  RO-TD  algorithm.  Convergence  analysis  of  RO-TD  is  presented  in  Section  4.10. 
Finally,  in  Section  4.1 1 ,  experimental  results  are  presented  to  demonstrate  the  effectiveness  of  RO-TD. 

4.2  Off-Policy  Convergent  Methods  for  Reinfiorcement  Learning 

A  Markov  Decision  Process  (MDP)  is  defined  by  the  tuple  (S,A,P^s,,R, 7),  comprised  of  a  set  of  states  S, 
a  set  of  (possibly  state-dependent)  actions  A  (As),  a  dynamical  system  model  comprised  of  the  transition 
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kernel  P°s,  specifying  the  probability  of  transition  to  state  s'  from  state  s  under  action  a,  a  reward  model  R, 
and  0  <  7  <  1  is  a  discount  factor.  A  policy  n  :  S  ->  A  is  a  deterministic  mapping  from  states  to  actions. 
Associated  with  each  policy  n  is  a  value  function  V" ,  which  is  the  fixed  point  of  the  Bellman  equation: 

Vn(s)  =  TwVn(s)  =  iT(s)  +  jP^V^is) 

where  RA  is  the  expected  immediate  reward  function  (treated  here  as  a  column  vector)  and  P*  is  the  state 
transition  function  under  fixed  policy  i r,  and  T*  is  known  as  the  Bellman  operator.  In  what  follows,  we  often 
drop  the  dependence  of  V7T,T7T,R7T  on  n,  for  notational  simplicity.  In  linear  value  function  approximation, 
a  value  function  is  assumed  to  lie  in  the  linear  span  of  a  basis  function  matrix  $  of  dimension  |S|  x  d, 
where  d  is  the  number  of  linear  independent  features.  Hence,  V  «  V  =  $0.  The  vector  space  of  all 
value  functions  is  a  normed  inner  product  space,  where  the  “length”  of  any  value  function  /  is  measured  as 
ll/ll!  =  YjS  £(s)/2(s)  =  /'£/  weighted  by  5,  where  H  is  defined  in  Figure  1 .  For  the  Hh  sample,  <t>t,<j>'t,  0t 
and  St  are  defined  in  Figure  1.  TD  learning  uses  the  following  update  rule  9t+i  =  0t  +  at5t<j>t,  where  at  is  the 
stepsize.  However,  TD  is  only  guaranteed  to  converge  in  the  on-policy  setting,  although  in  many  off-policy 
situations,  it  still  has  satisfactory  performance  [38].  TD  with  gradient  correction  (TDC)  [82]  aims  to  minimize 
the  mean-square  projected  Bellman  error  (MSPBE)  in  order  to  guarantee  off-policy  convergence.  MSPBE 
is  defined  as 


MSPBE(0)  =  ||$0  -  UT(<t>9)\\l  =  ($tS(T$0  -  $0))r($T5$)"1$T5(T$0  -  $0)  (8) 

To  avoid  computing  the  inverse  matrix  ($TS$)”1  and  to  avoid  the  double  sampling  problem  [80]  in  (8), 
an  auxiliary  variable  w  is  defined 

w  =  ($tE$)_1$t5(T$0  —  $0)  (9) 

The  two  time-scale  gradient  descent  learning  method  TDC  [82]  is  defined  below 

9t+ i  =  9t  +  at6t(f>t  -  at'y</>t(<l>fwt),wt+i  =  wt  +  /3t(6t  -  4>fwt)<f>t  (10) 

where  -at'i<t>t  {<t>f  wt)  is  the  term  for  correction  of  gradient  descent  direction,  and  /3t  =  rjat,r]  >  1. 


•  5  is  a  diagonal  matrix  whose  entries  £(s)  are  given  by  a  positive  probability  distri¬ 
bution  over  states,  n  =  $(<t>TS<T)_1$T5  is  the  weighted  least-squares  projection 
operator. 

•  A  square  root  of  A  is  a  matrix  B  satisfying  B2  =  A  and  B  is  denoted  as  Ai.  Note 
that  Ai  may  not  be  unique. 

•  [-,  ■]  is  a  row  vector,  and  [•;  •]  is  a  column  vector. 

•  For  the  Hh  sample,  </>t  (the  Hh  row  of  <t>),  <j)’t  (the  Hh  row  of  $')  are  the  feature 
vectors  corresponding  to  st,s't,  respectively.  9t  is  the  coefficient  vector  for  Hh 
sample  in  first-order  TD  learning  methods,  and  St  =  (rt  +7 4>'tT9t)  -  4>j9t  is  the 
temporal  difference  error.  Also,  xt  =  [tyt;0t],  at  is  a  stepsize,  =  mt,il  >  0. 

•  m,  n  are  conjugate  numbers  if  ^  ^  =  l,m  >  0,n  >  0.  ||a;||m  =  \xj\m)™  's 

the  TO-norm  of  vector  x. 

•  p  is  Zi  regularization  parameter,  A  is  the  eligibility  trace  factor,  N  is  the  sample 
size,  d  is  the  number  of  basis  functions,  p  is  the  number  of  active  basis  functions. 


Figure  5:  Notation  used  in  this  paper. 
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4.3  Proximal  Gradient  and  Saddle-Point  First-Order  Algorithms 

We  now  introduce  some  background  material  from  convex  optimization.  The  proximal  mapping  associated 
with  a  convex  function  h  is  defined  as:3 

1  2 

proxh(x)  =  argmin(/i(u)  +  -||«  —  a;||  )  (11) 

u  2 

In  the  case  of  h{x)  =  p||a:|| x (p  >  0),  which  is  particularly  important  for  sparse  feature  selection,  the  proximal 
operator  turns  out  to  be  the  soft-thresholding  operator  Sp{-),  which  is  an  entry-wise  shrinkage  operator: 

proxh{x)i  =  Sp(xi)  =  max(xi  -  p,  0)  -  max(-Xj  -  p,  0)  (1 2) 

where  i  is  the  index,  and  p  is  a  threshold.  With  this  background,  we  now  introduce  the  proximal  gradient 
method.  If  the  optimization  problem  is 

x*  =  arg  min  (f(x)  +  h(x))  (1 3) 

x£X 

wherein  f(x)  is  a  convex  and  differentiable  loss  function  and  the  regularization  term  h(x)  is  convex,  possibly 
non-differentiable  and  computing  proxh  is  not  expensive,  then  computation  of  (13)  can  be  carried  out  using 
the  proximal  gradient  method : 

xt+i  =  proxath  (xt  -  atVf(xt))  (14) 

where  at  >  0  is  a  (decaying)  stepsize,  a  constant  or  it  can  be  determined  by  line  search. 

4.4  Convex-concave  Saddle-Point  First  Order  Algorithms 

The  key  novel  contribution  of  our  paper  is  a  convex-concave  saddle-point  formulation  for  regularized  off- 
policy  TD  learning.  A  convex-concave  saddle-point  problem  is  formulated  as  follows.  Let  x  e  X, y  e  Y, 
where  X,  Y  are  both  nonempty  closed  convex  sets,  and  f(x)  :  X  ->•  K.  be  a  convex  function.  If  there  exists  a 
function  such  that  /( x)  can  be  represented  as  f(x)  :=  sup yeYp(x,y),  then  the  pair  (ip,Y)  is  referred 
as  the  saddle-point  representation  of  /.  The  optimization  problem  of  minimizing  /  over  X  is  converted 
into  an  equivalent  convex-concave  saddle-point  problem  SadVal  =  mixeXsvLpyeYp{x,y)  of  ^  on  X  x  Y. 
If  /  is  non-smooth  yet  convex  and  well  structured,  which  is  not  suitable  for  many  existing  optimization 
approaches  requiring  smoothness,  its  saddle-point  representation  p  is  often  smooth  and  convex.  The 
convex-concave  saddle-point  problems  are,  therefore,  usually  better  suited  for  first-order  methods  [77],  A 
comprehensive  overview  on  extending  convex  minimization  to  convex-concave  saddle-point  problems  with 
unified  variational  inequalities  is  presented  in  [6]. 

As  an  example,  consider  f{x)  =  ||  Ax  -  b||m  which  admits  a  bilinear  minimax  representation 

f(x)  :=  \\Ax-b\\m  =  max  yT{Ax-b)  (15) 

lli/ll«<i 

where  m,n  are  conjugate  numbers.  Using  the  approach  in  [57],  Equation  (15)  can  be  solved  as 

xt+ 1  =xt-  atATyt,  yt+1  =  Un(yt  +  at(Axt  -  b))  (1 6) 

where  Iin  is  the  projection  operator  of  y  onto  the  unit-/„  ball  ||y||n  <  1, which  is  defined  as 

n  ny  =  rnin(l,  1/I|y||n)y,n  =  2,(n  oay)i  =  min(l,l/|  yi\)yi,n  =  oo  (17) 

and  Uoo  is  an  entrywise  operator. 

4.5  Regularized  off-policy  convergent  TD-Learning 

We  now  describe  a  novel  algorithm,  regularized  off-policy  convergent  TD-learning  (RO-TD),  which  combines 
off-policy  convergence  and  scalability  to  large  feature  spaces.  The  objective  function  is  proposed  based  on 
the  linear  equation  formulation  of  the  TDC  algorithm.  Then  the  objective  function  is  represented  via  its  dual 
minimax  problem.  The  RO-TD  algorithm  is  proposed  based  on  the  primal-dual  subgradient  saddle-point 
algorithm,  and  inspired  by  related  methods  in  [56], [57], 

3The  proximal  mapping  can  be  shown  to  be  the  resolvent  of  the  subdifferential  of  the  function  h. 
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4.6  Objective  Function  of  Off-policy  TD  Learning 

In  this  subsection,  we  describe  the  objective  function  of  the  regularized  off-policy  RL  problem.  We  now 
first  formulate  the  two  updates  of  9t,wt  into  a  single  iteration  by  rearranging  the  two  equations  in  (10)  as 

xt+i  =  xt  -  at(Atxt  -bt),  where  xt  =  [wt;  9t], 


A  =  t)T  b  =  \  Wtfa 

7 t<i>tT  <f>t{<t>t  -7$' t)T  ’  *  L  _ 


(18) 


Following  [82],  the  TDC  algorithm  solution  follows  from  the  linear  equation  Ax  =  6,  where 

A  =  E  [At((f>t,(t>'t)\,b  =  E[  bt{(j>t,rt)\,x  =  [w;6\  (19) 

There  are  some  issues  regarding  the  objective  function,  which  arise  from  the  online  convex  optimization 
and  reinforcement  learning  perspectives,  respectively.  The  first  concern  is  that  the  objective  function  should 
be  convex  and  stochastically  solvable.  Note  that  A,At  are  neither  PSD  nor  symmetric,  and  it  is  not  straight¬ 
forward  to  formulate  a  convex  objective  function  based  on  them.  The  second  concern  is  that  since  we 
do  not  have  knowledge  of  A,  the  objective  function  should  be  separable  so  that  it  is  stochastically  solv¬ 
able  based  on  At,bt.  The  other  concern  regards  the  sample  complexity  in  temporal  difference  learning: 
double-sampling.  As  pointed  out  in  [80],  double-sampling  is  a  necessary  condition  to  obtain  an  unbiased 
estimator  if  the  objective  function  is  the  Bellman  residual  or  its  derivatives  (such  as  projected  Bellman  resid¬ 
ual),  wherein  the  product  of  Bellman  error  or  projected  Bellman  error  metrics  are  involved.  To  overcome 
this  sample  complexity  constraint,  the  product  of  TD  errors  should  be  avoided  in  the  computation  of  gradi¬ 
ents.  To  these  ends,  based  on  the  linear  equation  formulation  in  (19)  and  the  requirement  on  the  objective 
function  in  above,  we  propose  the  regularized  loss  function  as 

L(x)  =  mm\\Ax-b\\m  +  h(x)  (20) 

A 

Here  we  also  enumerate  some  intuitive  objective  functions  and  give  a  brief  analysis  on  the  reasons  why 
they  are  not  suitable  for  regularized  off-policy  first-order  TD  learning.  One  intuitive  idea  is  to  add  a  sparsity 
penalty  on  MSPBE,  i.e.,  L(6)  =  MSPBE(0)+p||0||1.  Because  of  the  penalty  term,  the  solution  to  VL  =  0 
does  not  have  an  analytical  form  and  is  thus  difficult  to  compute.  The  second  intuition  is  to  use  the  least 
square  formulation  of  the  linear  equation  Ax  =  b.  However,  since  A  is  not  symmetric  and  positive  semi- 
definite  (PSD),  A 5  does  not  exist  and  thus  Ax  =  b  cannot  be  reformulated  as  minxeX\\A^x  -  A~H\\l. 
Another  possible  idea  is  to  attempt  to  find  an  objective  function  whose  gradient  is  exactly  Atxt  -  bt  and  thus 
the  regularized  gradient  is  proxaih(Xl)(Atxt  -  bt).  However,  since  At  is  not  symmetric,  this  gradient  does 
not  explicitly  correspond  to  any  kind  of  optimization  problem,  not  to  mention  a  convex  one4. 


4.7  RO-TD  Algorithm  Design 

In  this  section,  the  problem  of  (20)  is  formulated  as  a  convex-concave  saddle-point  problem,  and  the  RO-TD 
algorithm  is  proposed.  Analogous  to  (15),  the  regularized  loss  function  can  be  formulated  as 

L(x)  =  min 1 1  Ax  -  6||m  +  h(x)  =  min  max  ((y,  Ax  -  b})  +  h(x)  =  L(x,  y)  (21 ) 

xex  xex\\y\\n<i 

Similar  to  (16),  Equation  (21)  can  be  solved  via  an  iteration  procedure  as  follows,  where  xt  =  [wt;9t ]. 

xt+i  =xt-  atAjyt  ,  yt+ 1  =  yt  +  at(Atxt  -  bt) 

xt+i  =  proxath{xt+i)  ,  yt+i=Iln(yt+i)  (22) 

The  averaging  step,  which  plays  a  crucial  role  in  stochastic  optimization  convergence,  generates  the  ap¬ 
proximate  saddle  points 


=  (£’ 


•>  Vt 


-E 


4Note  that  the  A  matrix  in  GTD2’s  linear  equation  representation  is  symmetric,  yet  is  not  PSD,  so  it  cannot  be  formulated  as  a 
convex  problem. 
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Due  to  the  computation  of  At  in  (22)  at  each  iteration,  the  computation  cost  appears  to  be  0(Nd2),  where  N 
is  sample  size  defined  in  Figure  5.  However,  the  computation  cost  is  actually  O(Nd)  with  a  linear  algebraic 
trick  by  computing  not  At  but  yj  AtlAtxt  -  bt.  Denoting  yt  =  [yi,t;y2,t],  where  yi,t',yi,t  are  column  vectors 
of  equal  length,  we  have 


yt  =  y  ytf  (yZM  + 1$ (yl'tftt)  (4>t  -  7 (t>'t)T ivvZt  +  \  (24) 

Atxt  -  bt  can  be  computed  according  to  Equation  (10)  as  follows: 

AtXt  -bt=[  -y(St  -  <j>fwt)4>t-,'y(4>Jwt)4>t  -  bt<pt  ]  (25) 

Both  (24)  and  (25)  are  of  linear  computation  complexity.  Now  we  are  ready  to  present  the  RO-TD  algorithm: 


Algorithm  1  RO-TD 

Let  7 r  be  some  fixed  policy  of  an  MDP  M,  and  let  the  sample  set  S  =  Let  be  some  fixed 

basis. 

i:  repeat 

2:  Compute  4>u  <t>t  and  TD  error  8t  =  (rt  +  'y4>'tTdt)  -  $ 9t 

3:  Compute  ( yt,At )  ,Atxt  -  bt  in  Equation  (24)  and  (25). 

4:  Compute  xt+i,yt+i  as  in  Equation  (22) 

5:  Set  t  i —  t  +  1] 

6:  until  t  =  Ni 

7:  Compute  xN,yN  as  in  Equation  (23)  with  t  =  N 


There  are  some  design  details  of  the  algorithm  to  be  elaborated.  First,  the  regularization  term  h(x) 
can  be  any  kind  of  convex  regularization,  such  as  ridge  regression  or  sparsity  penalty  HMli-  In  case  of 
h{x)  =  p||:r||i,  proxath(-)  =  SatP{-).  In  real  applications  the  sparsification  requirement  on  6  and  auxiliary 
variable  w  may  be  different,  i.e.,  h{x)  =  +  p2\\w\\1,pi  ±  p2,  one  can  simply  replace  the  uniform  soft 

thresholding  SatP  by  two  separate  soft  thresholding  operations  SatPl,SatP2  and  thus  the  third  equation  in 
(22)  is  replaced  by  the  following, 


A+i  =  SatPl{9t+i  ),wt+1  =  SatP2(wt+ 1)  (26) 

Another  concern  is  the  choice  of  conjugate  numbers  (m,  n).  For  ease  of  computing  n„,  we  use  (2, 2 )(l2  fit), 
(+oo,  l)(uniform  fit)  or  (1,  +oo).  m  =  n  =  2  is  used  in  the  experiments  below. 


xt+  b  = 


w. 


4.8  RO-GQ(A)  Design 


GQ(A)[43]  is  a  generalization  of  the  TDC  algorithm  with  eligibility  traces  and  off-policy  learning  of  temporally 
abstract  predictions,  where  the  gradient  update  changes  from  Equation  (10)  to 

9t+ i  =  9t  +  at[8tet  -  7(1  -  X)wtTet<j)t+i],  wt+i  =wt+  /3t{8tet  -  w? (27) 


The  central  element  is  to  extend  the  MSPBE  function  to  the  case  where  it  incorporates  eligibility  traces.  The 
objective  function  and  corresponding  linear  equation  component  At,  bt  can  be  written  as  follows: 

L{6)  =  ||$6>  — IIT7rAT><9||!  (28) 


Ati&t > 


7(1  -  X)4>t+ief 


—  'J' 

r)et(<i>t  -  7</>t+i) 

—  jp 

-  7</>i+i) 


,bt{<f>t,rt) 


Similar  to  Equation  (24)  and  (25),  the  computation  of  (yt,At) ,  Atxt  -  bt  is 


yrtet 

rtet 


(29) 


(yt,At)  =  [  V<t>f  (yi,t<t>t)  +7(1  -  X)ef(y%t4>t+i)  (0t -70t+i)T(7?j/^t +j/^)et  ] 

Atxt  —  bt  =  [  -y{5tet  -  (j>Jwt<l)t)nO-  ~  X){eJ wt)(j)t+i  ~  Stet  ]  (30) 

where  eligibility  traces  et,  and  , x,7rA  are  defined  in  [43].  Algorithm  2,  RO-GQ(A),  extends  the  RO-TD 
algorithm  to  include  eligibility  traces. 
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Algorithm  2  RO-GQ(A) 

Let  7T  and  $  be  as  defined  in  Algorithm  1 .  Starting  from  so¬ 
li  repeat 

2:  Compute  (j)t,  4>t+ 1  and  TD  error  6t  =  (rt  +  j<j>J+10t)  -  4>J Bt 

3:  Compute  (yt,  At) ,  Atxt  -  bt  in  Equation  (30). 

4:  Compute  xt+i,yt+i  as  in  Equation  (22) 

5:  Choose  action  at,  and  get  st+i 

6:  Set  t  i —  t  -f-  lj 

7:  until  st  is  an  absorbing  state; 

8:  Compute  xt,yt  as  in  Equation  (23) 


4.9  Extension 

It  is  also  noteworthy  that  to  reach  sparsity  and  regularization,  there  exists  another  formulation  of  loss  func¬ 
tion  different  from  Equation  (21  )  with  the  following  convex-concave  formulation  as  illustrated  in  [77]: 


L{x)  =  mm-\\Ax-b\\22+ pWx^  =  min  max  {bT y  -  ^yTy)  =  L(x,  y) 
x  2  X  ||A*’»||0O<i  2 

This  problem  is  able  to  be  solved  via  FOM  by  adding  an  auxiliary  variable  v. 


L(x,y,v)  =  min  max  (xTy  +  vT(Ax  —  b)  —  ^-vTv) 
x  II "v II _ _ <c i ,-lj  V  2  / 


(31) 


(32) 


which  can  be  solved  iteratively  without  proximal  gradient  step  as  follows,  which  is  a  counterpart  of  Equation 

(22), 


xt+  i  =xt-  atp{yt  +  AtT vt) 

at. 

yt+ b  =yt-\ — xt 

p 


Vt+ 1  =  Vt  +  —  ( Atxt  -bt-  pvt ) 
P 

yt+ 1  =  noo(yt+i) 


(33) 


Similar  as  the  trick  used  in  Equation  (24),  vj At  is  computed  with  linear  computation  cost  by  denoting 
vt  =  [ni ,uv2,t],  where  v\ ,t;tr2,t  are  column  vectors  of  equal  length,  we  have 


vjAt  = 


v<t>T (vTt<i>t)  +  i<t>T {viAt)  (<t>t  -  iftt)  (vvTt  +  vZAfa 


(34) 


4.10  Convergence  Analysis  of  RO-TD 

Assumption  1  (MDP)[82]:  The  underlying  Markov  Reward  Process  (MRP)  M  =  (S,P,R,  7)  is  finite  and 
mixing,  with  stationary  distribution  n.  Assume  that  3  a  scalar  i?max  such  that  Var[rt\st]  <  i?max  holds  w.p.1. 
Assumption  2  (Basis  Function):  $  is  a  full  column  rank  matrix,  namely,  $  comprises  a  linear  independent 
set  of  basis  functions  w.r.t  all  sample  states  in  sample  set  S.  Also,  assume  the  features  (</>t,<^)  have  uni¬ 
formly  bounded  second  moments.  Finally,  if  (st,at,s't)  is  an  i.i.d  sequence,  Moo  <  +00’  IKII00  <  +00- 
Assumption  3  (Subgradient  Boundedness)[56]:  Assume  for  the  bilinear  convex-concave  loss  function 
defined  in  (21),  and  the  sets  X,Y  are  closed  compact  sets.  Then  the  subgradient  (■ yt,At )  and  Atxt  - 
bt  in  RO-TD  algorithm  are  uniformly  bounded,  i.e.,  there  exists  a  constant  L  such  that  \\Atxt-bt\\  < 
L,  ||(yt,4t)||  <  L. 

Proposition  1:  The  approximate  saddle-point  xt  of  RO-TD  converges  w.p.1  to  the  global  minimizer  of  the 
following, 

2;*  =  &Tgmm\\Ax  -  b\\m  +  pWx^  (35) 

x£X 

Proof  Sketch:  We  give  an  outline  of  the  proof  here.  We  first  present  the  monotone  operator  corresponding 
to  the  bilinear  saddle-point  problem  and  then  extend  it  to  the  stochastic  approximation  case  with  certain 
restrictive  assumptions,  and  use  the  result  in  [77], 
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Comparison  of  MSPBE 


||Ax-b||2  and  <y,  Ax-b> 


MSPBE  Minimization 


Figure  6:  Illustrative  examples  of  the  convergence  of  RO-TD  using  the  Star  and  Random-walk  MDPs. 


The  monotone  operator  ${x,y)  with  saddle-point  problem  SadVal  =  infxexsup!(ey^(a;,  y)  is  a  point-to- 
set  operator 

$(x,y)  =  {dx(j){x,y)}  x  {-dv<i>(x,y)} 

where  dx4>(x,  y)  is  the  subgradient  of  <f>(x,  •)  over  x  and  dy<t>{ x,  y)  is  the  subgradient  of  (/>(-,  y)  over  y.  For  the 
bilinear  problem  in  Equation  (15),  the  corresponding  <t >{x,y)  is 

®{x,y)  =  {ATy,b-  Ax)  (36) 

Now  we  verify  that  the  problem  (20)  can  be  reduced  to  a  standard  bilinear  minimax  problem.  To  prove  this 
we  only  need  to  prove  that  the  set  X  in  our  RL  problem  is  indeed  a  closed  compact  convex  set.  This  is 
easy  to  verify  as  we  can  simply  define  X  =  {x|||a;||2  <  R}  where  R  is  large  enough.  In  fact,  the  sparse 
regularization  h(x)  =  p||x||i  helps  xt  stay  within  this  l2  ball.  Now  we  extend  this  argument  to  the  stochastic 
approximation  case.  Here,  the  objective  function  f(x)  is  given  by  the  stochastic  oracle,  and  in  our  case,  it  is 
Ax  -  b,  where  A ,  b  are  defined  in  Equation  (1 9).  Given  Assumption  3  and  further  assuming  that  the  noise  et 
for  t-X h  sample  is  i.i.d  noise,  then  using  the  result  in  [32],  we  can  prove  that  the  RO-TD  algorithm  converges 
to  the  global  minimizer  of 

x*  =  argmin||Aa;-6||m  +  p[|a;[|1 

x£X 

Then  we  prove  the  error  level  of  approximate  saddle-point  xt,yt  defined  in  (23)  is  atL2.  With  the  subgradient 
boundedness  assumption  and  using  the  result  in  Proposition  1  in  [56], 

~2oT  (llyo  ~y*W2  +  Iko-^ll2)  -  aL2  <  L(xt,yt )  -  L(x*,y*)  <  ^  (\\y0  -  yt\\2  +  ||a;0  -x*||2)  +  aL2 

where  L  is  the  subgradient  bound  in  Assumption  3,  and  L(x,  y)  is  the  loss  function  defined  in  Equation  (21 ), 
which  shows  that  the  loss  function  value  of  averaged  iterates  L(xt,yt )  converges  to  the  saddle-point  value 
L(x*,y*)  within  error  level  aL2  with  rate  1/t.  It  is  noteworthy  to  mention  that  the  error  level  can  be  further 
reduced  by  using  a  diminishing  stepsize. 

4.11  Empirical  Results 

We  now  demonstrate  the  effectiveness  of  the  RO-TD  algorithm  against  other  algorithms  across  a  number 
of  benchmark  domains.  LARS-TD  [37],  which  is  a  popular  second-order  sparse  learning  algorithm,  is  used 
as  the  baseline  algorithm  for  feature  selection  and  TDC  is  used  as  the  off-policy  convergent  RL  baseline 
algorithm,  respectively. 

4.12  MSPBE  Minimization  and  Off-Policy  Convergence 

This  experiment  aims  to  show  the  minimization  of  MSPBE  and  off-policy  convergence  of  the  RO-TD  algo¬ 
rithm.  The  7  state  star  MDP  is  a  well  known  counterexample  where  TD  diverges  monotonically  and  TDC 
converges.  It  consists  of  7  states  and  the  reward  w.r.t  any  transition  is  zero.  Because  of  this,  the  Star  MDP 
is  unsuitable  for  LSTD-based  algorithms,  including  LARS-TD.  The  random-walk  problem  [80]  is  a  standard 
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Markov  chain  with  5  states  and  two  absorbing  state  at  two  ends.  Three  sets  of  different  bases  <f>  are  used 
in  [82],  which  are  tabular  features,  inverted  features  and  dependent  features  respectively.  An  identical  ex¬ 
periment  setting  to  [82]  is  used  for  these  two  domains.  The  regularization  term  h(x)  is  set  to  0  to  make  a 
fair  comparison  with  TD  and  TDC.  a  =  0.01,  rj  =  10  for  TD,  TDC  and  RO-TD.  The  comparison  with  TD,  TDC 
and  RO-TD  is  shown  in  the  left  subfigure  of  Figure  6,  where  TDC  and  RO-TD  have  almost  identical  MSPBE 
over  iterations.  The  middle  subfigure  shows  the  value  of  ( yt,Axt  -  b)  and  \\Axt  -  b\\2,  wherein  \\Axt  -  b||2 
is  always  greater  than  the  value  of  ( yt,Axt  -  b).  Note  that  for  this  problem,  the  Slater  condition  is  satisfied 
so  there  is  no  duality  gap  between  the  two  curves.  As  the  result  shows,  TDC  and  RO-TD  perform  equally 
well,  which  illustrates  the  off-policy  convergence  of  the  RO-TD  algorithm.  The  result  of  random-walk  chain 
is  averaged  over  50  runs.  The  rightmost  subfigure  of  Figure  6  shows  that  RO-TD  is  able  to  reduce  MSPBE 
over  successive  iterations  w.r.t  three  different  basis  functions. 

4.13  Feature  Selection 

In  this  section,  we  use  the  mountain  car  example  with  a  variety  of  bases  to  show  the  feature  selection 
capability  of  RO-TD.  The  Mountain  car  MDP  [80]  is  an  optimal  control  problem  with  a  continuous  two- 
dimensional  state  space.  The  steep  discontinuity  in  the  value  function  makes  learning  difficult  for  bases 
with  global  support.  To  make  a  fair  comparison,  we  use  the  same  basis  function  setting  as  in  [37],  where 
two  dimensional  grids  of  2,4,8,16,32  RBFs  are  used  so  that  there  are  totally  1365  basis  functions.  For 
LARS-TD,  500  samples  are  used.  For  RO-TD  and  TDC,  3000  samples  are  used  by  executing  15  episodes 
with  200  steps  for  each  episode,  stepsize  at  =  0.001,  and  p1  =  0.01,  p2  =  0.2.  We  use  the  result  of 
LARS-TD  and  l2  LSTD  reported  in  [37].  As  the  result  shows  in  Table  1 ,  RO-TD  is  able  to  perform  feature 
selection  successfully,  whereas  TDC  and  TD  failed.  It  is  worth  noting  that  comparing  the  performance  of 
RO-TD  and  LARS-TD  is  not  the  focus  of  this  paper  since  LARS-TD  is  not  convergent  off-policy  and  RO-TD’s 
performance  can  be  further  optimized  using  the  mirror-descent  approach  with  the  Mirror-Prox  algorithm  [77] 
which  incorporates  mirror  descent  and  extragradient  [40],  as  discussed  below. 


Algorithm 

LARS-TD 

RO-TD 

h  LSTD 

TDC 

TD 

Success(20/20) 

100% 

100% 

0% 

0% 

0% 

Steps 

142.25  ±  9.74 

147.40  ±  13.31 

- 

- 

- 

Table  1 :  Comparison  of  TD,  LARS-TD,  RO-TD,  l2  LSTD,  TDC  and  TD 


4.14  High-dimensional  Under-actuated  Systems 

The  triple-link  inverted  pendulum  [75]  is  a  highly  nonlinear  under-actuated  system  with  8-dimensional  state 
space  and  discrete  action  space.  The  state  space  consists  of  the  angles  and  angular  velocity  of  each 
arm  as  well  as  the  position  and  velocity  of  the  car.  The  discrete  action  space  is  {0, 5Newton,  -5Newton}. 
The  goal  is  to  learn  a  policy  that  can  balance  the  arms  for  Nx  steps  within  minimum  number  of  learning 
episodes.  The  allowed  maximum  number  of  episodes  is  300.  The  pendulum  initiates  from  zero  equilibrium 
state  and  the  first  action  is  randomly  chosen  to  push  the  pendulum  away  from  initial  state.  We  test  the 
performance  of  RO-GQ(A),  GQ(A)  and  LARS-TD.  Two  experiments  are  conducted  with  Nx  =  10, 000  and 
100,000,  respectively.  Fourier  basis  [39]  with  order  2  is  used,  i.e.,  6561  basis.  Table  2  shows  the  results  of 
this  experiment,  where  RO-GQ(A)  performs  better  than  other  approaches,  especially  in  Experiment  2,  which 
is  a  harder  task.  LARS-TD  failed  in  this  domain,  which  is  mainly  not  due  to  LARS-TD  itself  but  the  quality 


Experiment\Method 

RO-GQ(A) 

GQ(A) 

LARS-TD 

Experiment  1 

6.9  ±4.82 

11.3  ±9.58 

- 

Experiment  2 

14.7  ±  10.70 

27.2  ±6.52 

- 

Table  2:  Comparison  of  RO-GQ(A),  GQ(A),  and  LARS-TD  on  Triple-Link  Inverted  Pendulum  Task  showing 
minimum  number  of  learning  episodes. 
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of  samples  collected  via  random  walk.  We  also  find  that  tuning  the  sparsity  parameter  p2  generates  an 
interpolation  between  GQ(A)  and  TD  learning,  where  a  large  p2  helps  eliminates  the  correction  term  of  TDC 
update  and  make  the  update  direction  more  similar  to  the  TD  update.  To  sum  up,  RO-TD  and  RO-GQ(A) 
tends  to  outperform  GQ(A)  in  all  aspects,  and  is  able  to  outperform  LARS-TD  in  high  dimensional  domains, 
as  well  as  in  selected  smaller  MDPs  where  LARS-TD  diverges  (e.g.,  the  Star  MDP).  It  is  worth  noting  that 
the  computation  cost  of  LARS-TD  is  OjNdp3),  where  that  for  RO-TD  is  O(Nd).  If  p  is  linear  or  sublinear  w.r.t 
d,  RO-TD  has  a  significant  advantage  over  LARS-TD.  However,  compared  with  LARS-TD,  RO-TD  requires 
fine  tuning  the  parameters  of  at,pi,p2  and  is  usually  not  as  sample  efficient  as  LARS-TD. 

4.15  Conclusion 

This  section  presented  a  novel  unified  framework  for  designing  regularized  off-policy  convergent  RL  al¬ 
gorithms  combining  a  convex-concave  saddle-point  problem  formulation  for  RL  with  stochastic  first-order 
methods.  A  detailed  experimental  analysis  reveals  that  the  proposed  RO-TD  algorithm  is  both  off-policy 
convergent  and  is  robust  to  noisy  features.  There  are  many  interesting  future  directions  for  this  research. 
One  direction  for  future  work  is  to  extend  the  subgradient  saddle-point  solver  to  a  more  generalized  mirror 
descent  framework.  Mirror  descent  is  a  generalization  of  subgradient  descent  with  non-Euclidean  distance 
[6],  and  has  many  advantages  over  gradient  descent  in  high-dimensional  spaces  In  [77],  two  algorithms  to 
solve  the  bilinear  saddle-point  formulation  are  proposed  based  on  mirror  descent  and  extragradient  [40], 
such  as  the  Mirror-Prox  algorithm.  [77]  also  points  out  that  the  Mirror-Prox  algorithm  may  be  further  op¬ 
timized  via  randomization.  To  scale  to  larger  MDPs,  it  is  possible  to  design  SMDP-based  mirror-descent 
methods  as  well. 

5  Sparse  Nonlinear  Value  Function  Approximation 

Finally,  we  describe  a  new  approach  to  representation  discovery  in  reinforcement  learning  (RL)  using  basis 
adaptation.  We  introduce  a  general  framework  for  basis  adaptation  as  nonlinear  separable  least-squares 
value  function  approximation  based  on  finding  Frechet  gradients  of  an  error  function  using  variable  projec¬ 
tion  functionals.  We  then  present  a  scalable  proximal  gradient-based  approach  for  basis  adaptation  using 
the  recently  proposed  mirror-descent  framework  for  RL.  Unlike  traditional  temporal-difference  (TD)  meth¬ 
ods  for  RL,  mirror  descent  based  RL  methods  undertake  proximal  gradient  updates  of  weights  in  a  dual 
space,  which  is  linked  together  with  the  primal  space  using  a  Legendre  transform  involving  the  gradient  of 
a  strongly  convex  function.  Mirror  descent  RL  can  be  viewed  as  a  proximal  TD  algorithm  using  Bregman 
divergence  as  the  distance  generating  function.  We  present  a  new  class  of  regularized  proximal-gradient 
based  TD  methods,  which  combine  feature  selection  through  sparse  Li  regularization  and  basis  adaptation. 
Experimental  results  are  provided  to  illustrate  and  validate  the  approach. 

5.1  Introduction 

There  has  been  rapidly  growing  interest  in  reinforcement  learning  in  representation  discovery  [45].  Basis 
construction  algorithms  [46]  combine  the  learning  of  features,  or  basis  functions,  and  control.  Basis  adap¬ 
tation  [7,  53]  enables  tuning  a  given  parametric  basis,  such  as  the  Fourier  basis  [39],  radial  basis  functions 
(RBF)  [53],  polynomial  bases  [41],  etc.  to  the  geometry  of  a  particular  Markov  decision  process  (MDP). 
Basis  selection  methods  combine  sparse  feature  selection  through  L±  regularization  with  traditional  least- 
squares  type  RL  methods  [36,  27],  linear  complementarity  methods  [27],  approximate  linear  programming 
[65],  or  convex-concave  optimization  methods  for  sparse  off-policy  TD-learning  [42], 

In  this  paper,  we  present  a  new  framework  for  basis  adaptation  as  nonlinear  separable  least-squares 
approximation  of  value  functions  using  variable  projection  functionals.  This  framework  is  adapted  from  a 
well-known  classical  method  for  nonlinear  regression  [21  ],  when  the  model  parameters  can  be  decomposed 
into  a  linear  set,  which  is  fit  by  classical  least-squares,  and  a  nonlinear  set,  which  is  fit  by  a  Gauss-Newton 
method  based  on  computing  the  gradient  of  an  error  function  based  on  variable  projection  functionals. 

Mirror  descent  is  a  highly  scalable  online  convex  optimization  framework.  Online  convex  optimization 
[88]  explores  the  use  of  first-order  gradient  methods  for  solving  convex  optimization  problems.  Mirror  de- 
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scent  can  be  viewed  as  a  first-order  proximal  gradient  based  method  [3]  using  a  distance  generating  function 
that  is  a  Bregman  divergence  [1 3].  We  combine  basis  adaptation  with  mirror-descent  RL  [47],  a  recently  de¬ 
veloped  first-order  approach  to  sparse  RL.  The  proposed  approach  is  also  validated  with  some  experiments 
showing  improved  performance  compared  to  previous  work. 

5.2  Nonlinear  Value  Function  Approximation 

A  key  contribution  of  this  paper  is  a  general  framework  for  nonlinear  value  function  approximation  based 
on  nonlinear  separable  least-squares.  This  framework  helps  clarify  previously  proposed  algorithms  [7, 
53],  which  amount  to  particular  instances  of  this  general  framework.  Concretely,  basis  adaptation  can  be 
understood  as  modifying  a  parameterized  basis,  $(«)  where  a  denotes  the  nonlinear  parameters,  and  w 
denotes  the  linear  parameters,  such  that  V  «  4>(a)u>.  For  example,  for  radial  basis  function  (RBF)  bases,  a 
would  correspond  to  the  mean  (center),  pu  and  covariance  (width)  parameters,  E,;,  of  a  fixed  set  of  bases, 
where  these  would  be  tuned  based  on  optimizing  some  particular  error,  such  as  the  Bellman  error  [53]. 
Such  an  approach  can  broadly  be  characterized  using  the  framework  of  separable  nonlinear  least  squares 
framework  for  regression  [21],  which  provides  a  framework  for  unifying  past  work  on  basis  adaptation.  The 
general  nonlinear  regression  problem  is  to  estimate  a  vector  of  parameters  9  from  a  set  of  training  examples 
V  =  {(zi,  j/i ), . . . ,  (xn,  yn)j  to  minimize  (for  example)  the  squared  error: 


=  X>< -/(*<;  *)ii2- 

i— 1 

In  the  nonlinear  separable  least  squares  setting,  the  function  f{xq  6)  =  f{xp  a,  w )  =  [4>(a)w]j,  where  a  are 
the  nonlinear  variables  and  w  denote  the  linear  variables.  The  squared  error  can  now  be  written  as: 

J(oi,w)  =  \\y  -  4>(a)w||2  =  ||  (I  -  4>(a)$f(a))  y||2 

Golub  and  Pereyra  define  the  last  (boldfaced)  term  above  as  the  variable  projection  functional  (VP).  To 
explain  the  derivation  of  the  second  step  from  the  first,  note  that  for  a  specified  value  of  the  nonlinear 
parameters  a,  for  the  best  linear  fit,  the  linear  parameter  values  can  computed  as  w  =  &(a)y  (here,  f 
denotes  the  Moore-Penrose  pseudo-inverse).  Geometrically,  note  that 

ni(a)  =  (/-$(«)4>t(a)) 

is  the  projector  onto  the  complement  of  the  column  space  of  the  parameterized  basis  matrix  $(«).  Using 
this  notation,  the  nonlinear  separable  least  squares  problem  can  be  stated  formally  as: 

Q  opt  =  ars  main  1 1  ni(«)  y  1 1 2  (37) 

If  this  problem  is  solved  to  find  the  optimal  a0pt,  the  optimal  linear  parameters  u>0pt  =  4^  (a0pt);y.  To  argue 
that  this  strategy  is  correct,  Golub  and  Pereyra  prove  the  following  result: 

Theorem  1.  [21]  Assume  that$(a)  has  constant  rank  r  in  every  open  setQ  c  Rk,  where  a  e  Ft.  lf(a*,w*) 
is  the  global  minimizer  of  J (a,  w ),  then  a0pf  =  a*  and  w0pt  =  w*. 

To  solve  the  optimization  problem  given  by  Equation  37,  Golub  and  Pereyra  derive  a  modified  Gauss- 
Newton  algorithm.  The  key  step  in  their  algorithm  involves  computation  of  the  Jacobian  V||n^(Q)y||2,  which 
involves  finding  the  gradient  of  the  pseudo  inverse  as  follows: 

^vllni(a)yll2  =  (a)y) 

where  £><£>(«)  is  the  Frechet  (or  tensor)  derivative  of  the  parameterized  basis  matrix  4>(a).  If  there  are  m 
basis  functions  fa  (columns  of  4>(a)),  n  training  points,  and  k  nonlinear  parameters  a,  then  £>$(«)  is  a 
tensor  of  size  nx  mx  k. 
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5.3  Bellman  Residual  VP  Functional 

The  Nonlinear  Bellman  Residual  minimization  problem  is  to  find  a  set  of  nonlinear  weights  aBR  €  and 
linear  weights  wBr  such  that5 

(■ (xbr ,  wBr)  =  argmin^  jr^VO  -  V\\2p„  (38) 

where  V  =  $(a)w  «  Vir,  and  the  norm  ||x||^  =  (x,Dp*x)  where  Dp *  is  a  diagonal  matrix  whose  entries 
reflect  the  stationary  distribution  of  the  Markov  chain  corresponding  to  policy  n.  To  lighten  the  notation 
below,  we  will  assume  Dp-«  =  /,  although  our  entire  derivation  carries  through  without  change  in  the  more 
general  case. 

Defining  =  ((/  -  7P,r)<I>(a)),  we  can  now  eliminate  the  linear  variables  wNBR  and  write  the  nonlinear 
Bellman  residual  variable  projection  functional  as  follows:  aNBR  = 

=  argmin  ||i?7r  +  /yP7VV  —  V\\2 

a 

=  argmin  \\Rn  +  (7P'r$(a)  -  $(a))(^)tP7r||2 

a 

=  argmin  ||  [/  -  (J  -  7Pw)$(a)(^)tl  Rn ||2 

a 

We  can  rewrite  the  final  result  above  as  aNBR  = 

argmin  \\Il^R*\\2  =  argmin  ||(J  -  T'^(^)t)P’r||2.  (39) 

a  a  a 

Given  aNBR,  the  linear  weights  wNBR  can  be  written  as: 

wNbr  =  mbr)R7T  =  ~  7 P7T)^{aNBR))^  R w 

Comparing  with  Equation  37,  the  optimization  problem  for  nonlinear  Bellman  residual  minimization  once 
again  involves  minimization  of  a  projector  onto  the  complement  of  a  matrix  column  space,  in  this  case  \I>£. 
We  can  state  a  new  result  extending  [21]  to  the  RL  setting: 

Theorem  2.  Assume  that  T'jj  has  constant  rank  r  in  every  open  set  0  c  where  a  e  Cl.  If  (aBR,  wBR)  is 
the  global  minimizer  of  Equation  38,  then  aNBR  =  aBR  and  wNBR  =  wBR. 

5.4  Nonlinear  LSPI 

We  briefly  describe  a  nonlinear  variant  of  least-squares  policy  iteration  (LSPI),  where  on  each  pass  through 
the  training  data  V  =  {st,at,rt,st+ 1},  a  nonlinear  optimization  problem  involving  the  best  selection  of  the 
nonlinear  parameters  a  is  computed  by  solving  (a  sampled  version  of)  Equation  39.  We  used  lsqnonlin,  a 
nonlinear  least-squares  solver  within  MATLAB,  to  solve  a  sampled  variant  of  Equation  39,  where  $(at)  and 
PnA>(at)  were  approximated  by  their  sampled  counterparts  <f>(at)  and  <t>'(at).  Once  aNBR  was  found  by 
the  nonlinear  phase,  we  used  LSPI  with  these  settings  to  find  the  linear  weights  wNBR.  Figure  7  illustrates 
the  application  of  this  approach  to  a  simple  25  state  chain  domain  [41],  The  reward  is  at  the  center,  so  the 
optimal  policy  is  to  go  right  in  the  first  half  of  the  chain  and  go  left  otherwise.  The  Q  function  is  approximated 
by  a  set  of  radial  basis  functions  (RBFs),  whose  initial  centers  and  widths  are  specified  as  shown.  The  figure 
also  shows  the  final  modified  centers  and  widths  of  the  RBFs  using  the  Bellman  residual  variable  projection 
functional  minimization  method. 

5.5  Mirror  Descent  RL 

While  the  above  framework  of  nonlinear  separable  value  function  approximation  using  variable  projection 
functionals  is  theoretically  elegant,  it  can  be  computationally  expensive  to  compute  Frechet  gradients  of  the 
variable  projection  functional  in  Equation  39  in  large  MDPs.  We  describe  a  novel  mirror-descent  gradient- 
based  approach  to  basis  adaptation  that  shows  how  to  combine  nonlinear  value  function  approximation  and 
sparsity. 

5Similar  VP  functionals  can  be  defined  for  other  loss  functions,  e.g.  for  the  fixed  point  or  TD  case. 


19 


Initial  RBF  centers 


Final  RBF  centers 


Final  RBF  Widths 


Figure  7:  This  figure  shows  the  results  of  nonlinear  basis  adaptation  by  solving  the  Bellman  residual  variable 
projection  functional  represented  by  Equation  39  to  find  the  optimal  setting  for  the  RBF  centers  and  widths 
in  a  simple  25  state  chain  domain.  Top  two  figures:  initial  centers  and  widths  of  RBF  bases.  Bottom  two 
figures:  Final  centers  and  widths  after  convergence  of  nonlinear  LSPI  to  the  optimal  policy. 


5.6  Proximal  Mappings  and  Mirror  Descent 

Mirror  descent  can  be  viewed  as  a  first-order  proximal  gradient  method  [3,  88]  for  solving  high-dimensional 
convex  optimization  problems.  The  simplest  online  convex  algorithm  is  based  on  the  classic  gradient  de¬ 
scent  procedure  for  minimizing  a  convex  function  /,  given  as: 


Wo  e  X,  wt  =  n.Y  (wt~  1  -  /3tV/(wt_i))  :  t  >  1  (40) 

where  Ux(x)  =  argminyeX||a;  -  y\\ 2  is  the  projector  onto  set  X,  and  is  a  stepsize.  If  /  is  not  differen¬ 
tiable,  then  the  subgradient  df  can  be  substituted  instead,  resulting  in  the  well-known  projected  subgradient 
method,  a  workhorse  of  nonlinear  programming.  The  proximal  mapping  associated  with  a  convex  function 
h  is  defined  as: 

prox^Or)  =  argminu  (h(u)  +  ||u  -  x\\22) 

If  h{x )  =  0,  then  proxh(x)  =  x,  the  identity  function.  If  h( x)  =  Ic(x),  the  indicator  function  for  a  convex  set 
C,  then  proxJc(a;)  =  nc(ir),  the  projector  onto  set  C.  For  learning  sparse  representations,  the  case  when 
h(w)  =  HMli,  f°r  some  scalar  weighting  factor  v,  is  particularly  important.  In  this  case,  the  entry-wise 
proximal  operator  is: 

{Wi  —  IS,  if  Wi  >  IS 

0,  if  |tWi|  <  v  (41) 

Wi  +  v  otherwise 

An  interesting  observation  follows  from  noting  that  the  projected  subgradient  method  (Equation  40)  can  be 
written  equivalently  using  the  proximal  mapping  as: 


Ik-wtlll)  (42) 

An  intuitive  way  to  understand  this  equation  is  to  view  the  first  term  as  requiring  the  next  iterate  wt+ 1  to 
move  in  the  direction  of  the  (sub)  gradient  of  /  at  wt,  whereas  the  second  term  requires  that  the  next  iterate 
wt+ 1  not  move  too  far  away  from  the  current  iterate  wt.  Note  that  the  (sub)gradient  descent  is  a  special 
case  of  Equation  (42)  with  Euclidean  distance  metric.  With  this  introduction,  we  can  now  introduce  the  main 
concept  of  mirror  descent.  We  follow  the  treatment  in  [3]  in  presenting  the  mirror  descent  algorithm  as  a 
nonlinear  proximal  method  based  on  a  distance  generating  function  that  is  a  Bregman  divergence  [13]. 

Definition  1:  A  distance  generating  function  ip(x)  is  defined  as  a  continuously  differentiable  strongly 
convex  function  (with  modulus  a)  which  satisfies: 

(x1  —  x,  Xip(x')  —  Vi/j(x))  >  a\\x'  —  x\\2  (43) 


wt+i  =  argmin^gy 


[w,df{wt))  +  ^ 
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Given  such  a  function  ip,  the  Bregman  divergence  associated  with  it  is  defined  as: 


D^(x,  y)  =  ip(x)  -  ip(y)  -  (Vip(y),  x  -  y)  (44) 

Intuitively,  the  Bregman  divergence  measures  the  difference  between  the  value  of  a  strongly  convex  function 
ip{x)  and  the  estimate  derived  from  the  first-order  Taylor  series  expansion  at  ip(y).  Many  widely  used 
distance  measures  turn  out  to  be  special  cases  of  Bregman  divergences,  such  as  Euclidean  distance 
(where  ip{x)  =  ^ ||ar|||  )  and  Kullback  Liebler  divergence  (where  ip( x)  =  Y^ixi\og2xi,  the  negative  entropy 
function).  In  general,  Bregman  divergences  are  non-symmetric,  but  projections  onto  a  convex  set  with 
respect  to  a  Bregman  divergence  are  well-defined. 

The  general  mirror  descent  procedure  can  be  written  as: 

Wt+i  =  argminu,6X  ^{w,  df(wt))  +  ^-D^(w,  wt)j  (45) 

Notice  that  the  squared  distance  term  in  Equation  42  has  been  generalized  to  a  Bregman  divergence.  The 
solution  to  this  optimization  problem  can  be  stated  succinctly  as  the  following  generalized  gradient  descent 
algorithm,  which  forms  the  core  procedure  in  mirror  descent: 


wt+i  =  Vi/’*  (Vip(wt)  -  atdf(wt ))  (46) 

Here,  ip*  is  the  Legendre  transform  of  the  strongly  convex  function  ip,  which  is  defined  as 

ip*{y)  =  sup  ((x,y)  -  ip(x)) 

x£X 

It  can  be  shown  that  V0*  =  (V0)-1  [3].  Mirror  descent  is  a  powerful  first-order  optimization  method 
that  been  shown  to  be  “universal”  in  that  if  a  problem  is  online  learnable,  it  leads  to  a  low-regret  solution 
using  mirror  descent  [78].  It  is  shown  in  [5]  that  the  mirror  descent  procedure  specified  in  Equation  46 
with  the  Bregman  divergence  defined  by  the  p-norm  function,  defined  below,  can  outperform  the  projected 
subgradient  method  by  a  factor  |o”n  where  n  is  the  dimensionality  of  the  space.  For  high-dimensional 
spaces,  this  ratio  can  be  quite  large. 

5.7  Sparse  Learning  with  Mirror  Descent  TD 

Recently,  a  new  framework  for  sparse  regularized  RL  was  proposed  based  on  mirror  descent  [47],  Unlike 
regular  TD,  the  weights  are  updated  using  the  TD  error  in  the  dual  space  by  mapping  the  primal  weights 
w  using  a  gradient  of  a  strongly  convex  function  ip.  Subsequently,  the  updated  dual  weights  are  converted 
back  into  the  primal  space  using  the  gradient  of  the  Legendre  transform  of  ip,  namely  Vi/’*.  To  achieve 
sparsity,  the  dual  weights  9  are  truncated  according  to  Equation  41  to  satisfy  the  /  ,  penalty  on  the  weights. 
Here,  v  is  the  sparsity  parameter.  An  analogous  approach  for  lx  penalized  classification  and  regression  was 
suggested  in  [74], 

One  choice  for  Bregman  divergence  is  the  p-norm  function,  where  ip{w)  =  i||u;||g,  anc*  its  conjugate 

1 

Legendre  transform  ip*{w)  =  Here,  ||u>||g  =  \wi\q')  " <  and  P  anc*  Q  are  conjugate  numbers  such 

that  i  |  =  1.  This  ip{w)  leads  to  the  p-norm  link  function  9  =  f(w)  where  /  :  Rd  Rd  : 

f  signKOKr1  f-im  sign(0j)|0j|p-1 

f,(  )  -  iMir  '  ’•  (  )  ~  ii«»r2  m 

Many  other  choices  for  Bregman  divergence  are  possible. 

5.8  Basis  Adaptation  with  Mirror  Descent  RL 

Our  method  is  based  on  the  two-time  scale  stochastic  approximation  framework,  whereby  the  linear  param¬ 
eters  w  are  adapted  at  a  faster  time-scale  than  the  nonlinear  parameters  a.  Algorithm  1  below  describes 


21 


the  nonlinear  adaptive  basis  mirror-descent  variant  of  Watkins  Q(X)  algorithm.6  We  indicate  the  dynamically 
varying  nature  of  the  bases  as  <j>t(st,at)  where  the  subscript  denotes  the  particular  value  of  the  nonlinear 
parameters  at  at  time  t.  In  this  section,  we  denote  /3t  as  the  learning  rate  for  the  faster  time-scale  update 
procedure  for  updating  the  linear  weights  wt  and  as  the  learning  rate  for  the  slower  time-scale  parameter 
for  updating  the  nonlinear  basis  parameters  at.  The  key  idea  underlying  Algorithm  1  is  to  adapt  the  nonlin¬ 
ear  parameters  using  the  gradient  of  the  TD  error.  As  the  gradient  of  the  TD  error  is  not  easily  computed 
due  to  the  max  computation  in  Q( A),  what  is  done  here  is  to  approximate  the  maximum  using  a  smoothed 
differentiable  approximation: 


max  Xi  «  fa({xi}i= =  log(V'e‘”:i) 

i=l,...,n  z ' 

i= 1 

We  update  the  nonlinear  basis  parameters  at  on  a  slower  time  scale  using  the  gradient  of  the  smoothed 
TD  error.  The  mirror-descent  version  of  Watkins  Q( A)  with  basis  adaptation  is  given  below. 


Algorithm  3  Mirror  Descent  Q(A)  with  Basis  Adaptation 
i:  Given:  Parametric  basis  <t>(a) 

2:  Initialize  linear  parameters  w0  and  nonlinear  parameters  a0. 

3:  repeat 

4:  Do  action  at  =  argma xa*((j>t(st,a*),wt)  with  high  probability,  else  do  a  random  action.  Observe  next 

state  st+i  and  reward  rt. 

5:  Compute  the  actual  TD  error 


St  =  n  +  7  max(<^t(st+i,  a),  wt)  -  at),wt) 

a 

6:  Compute  the  smoothed  TD  error 

wt  =  rt  +  /yftT({((j>t(st+i,a'),Wt)}a')  -  {4>t{sti  at),wt) 


7:  Compute  the  gradient  of  the  smoothed  TD  error  w.r.t..  a 


Ka(st,at,st+ 1)  = 

a' 


X  X7a(((j>t(st+i,a'),wt))  -  Va((0t(st+i,  at),  wt)) 

8:  Update  the  eligibility  trace  et  et  +  A7 4>t(st,  at) 

9:  Update  the  dual  weights  9t  =  Vipt(wt)  +  / 3t6tet  (e.g.,  ip(w)  =  i||w||g  is  the  p-norm  link  function). 

10:  Truncate  weights: 

Vj,  6»‘+1  =  sign(0‘+1)  max(0,  \d]+1  \  -  /3tv) 

ii:  Update  linear  weight  parameters:  wt+ 1  =  (0t+i)  (e.g.,  tp*(9)  =  \\\0\\p  andp  and  q  are  dual  norms 

such  that  i  i  =  1). 

12:  Update  nonlinear  basis  parameters: 


ott+i  +  ^tUt.Ka 


13:  Set  t  4 —  t  1. 

14:  until  done. 

Return  Q(s,a)  =  {${at)wt)s,a  as  the  approximate  h  penalized  sparse  optimal  action  value  function. 
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Comparison  of  Four  Methods  over  25  runs 


Figure  8:  Comparing  the  convergence  of  two  mirror-descent  Q( A)  methods  over  25  runs  with  49  (a  and  ct 
parameter  tuned)  nonlinear  Fourier  bases  in  the  mountain  car  task  with  that  of  Q( A),  and  mirror-descent 
Q(\),  with  49  fixed  Fourier  bases. 

5.9  Experimental  Results 

In  this  section,  we  provide  some  illustrative  experiments  evaluating  the  performance  of  the  mirror-descent 
Q(X)  method  with  a  nonlinear  generalization  of  the  fixed  Fourier  basis  [39].  The  nonlinear  Fourier  basis  that 
we  used  is  as  follows: 

4>z{x)  =  cos  (nai(ci,x)) 

The  nonlinear  parameters  here  include  both  a*  and  ct.  Note  that  in  the  previous  work  on  Fourier  bases  [39], 
the  scaling  factor  a,  was  fixed  to  unity,  whereas  in  our  case,  it  is  allowed  to  vary  (as  shown  in  Figure  10). 
Also,  earlier  work  set  c*  =  [ci, . . .  ,cd},Cj  e  [0, . . .  ,n],  1  <  j  <  d.  For  example,  in  the  mountain  car  domain, 
d  =  2,  and  setting  n  =  3  would  result  in  16  possible  features  (since  n  =  0, 1, 2, 3).  The  nonlinear  parameter 
at  can  be  seen  here  as  a  weighting  of  how  important  each  c*  is.  In  our  initial  experiments,  we  vary  both  a 
and  Ci  separately  to  compare  their  effects. 

Figure  8  compares  the  convergence  and  stability  of  Q(A)  and  mirror-descent  Q( A)  using  49  fixed  Fourier 
bases  with  mirror-descent  Q(A)  using  the  same  number  of  adaptive  (a  and  Ct  parameter  tuned)  nonlinear 
Fourier  bases.  In  this  case,  all  methods  converge,  but  the  two  nonlinear  mirror-descent  methods  (bottom 
curves)  converge  much  more  quickly.  The  mirror-descent  methods  have  converged  by  7  episodes  to  an 
average  solution  length  at  least  thrice  as  small  as  the  regular  Q{ A)  method,  which  takes  much  longer  to 
reliably  achieve  the  same  solution  quality.  For  the  experimental  result  shown  in  Figure  8,  where  we  show 
convergence  in  the  mountain  car  domain  with  49  Fourier  bases,  we  set  /3t  =  0.03,  =  5  x  10~7,  and 

v  =  0.001.  The  p-norm  Bregman  divergence  was  used,  where  p  =  21og10(num_features  x  num_actions), 
and  q  =  The  learning  rate  for  Q(A)  had  to  be  set  fairly  low  at  0.001  when  the  number  of  features 
exceeded  16  as  higher  rates  caused  instability. 

Figure  9  shows  the  variance  across  25  runs  of  the  four  methods  compared  in  Figure  8.  Note  the 
improved  variance  of  the  two  mirror-descent  methods  that  use  basis  adaptation,  showing  they  not  only 
result  in  improved  performance,  but  also  lower  variance.  Finally,  Figure  10  shows  the  ranges  of  the  a 
parameters  as  tuned  by  the  basis  adaptation  algorithm  for  49  Fourier  bases  over  25  runs. 

5.10  Conclusions  and  Future  Work 

This  section  introduced  a  broad  framework  for  basis  adaptation  as  nonlinear  separable  least-squares  value 
approximation,  and  described  a  specific  gradient-based  method  for  basis  adaptation  using  mirror  descent 

6The  extension  to  related  methods  like  SARSA( A)  [80]  is  straightforward.  Some  details  abbreviated  for  clarity. 
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Mirror  Descent  Q  Nonlinear  Ci  Mirror  Descent  Q  ,  Nonlinear  Alpha 


Mirror  Descent  Q(lambda):  25  runs  Q(lambda):  25  runs 


Figure  9:  Variance  of  the  four  methods  compared  in  Figure  8  across  25  runs.  Note  the  improved  variance 
of  the  two  methods  shown  on  top  that  combine  mirror-descent  updating  and  basis  adaptation. 


Mirror  Descent  Q(lambda)  with  Nonlinear  Fourier  Basis:  25  runs 


Mirror  Descent  Q(Lambda)  With  Nonlinear  Fourier  Basis  Weights:  25  runs 


Fourier  Basis  Parameter  (features  =  49,  learning  rate:  0.03) 


Figure  10:  (a)  Extended  run  of  the  mirror-descent  Q{ A)  method  with  basis  adaptation  of  the  a  parameters 
in  the  mountain  car  domain,  (b)  Boxplot  of  the  a  parameters  for  49  Fourier  bases  for  the  run  shown  on  top. 


optimization.  Currently,  the  learning  rate  for  nonlinear  parameters  needs  to  be  set  quite  low  for  stable 
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convergence.  More  rapid  tuning  of  the  nonlinear  parameters  could  be  achieved  by  including  an  additional 
mirror  descent  step  for  modifying  them. 


6  Summary 

This  research  investigated  the  automated  design  of  problem-specific  representations  for  approximately 
solving  Markov  decision  processes  (MDPs),  a  widely  used  model  of  sequential  decision  making.  The 
research  carried  out  nvestigated  a  variety  of  approaches  to  automatic  basis  construction,  including  reward- 
sensitive  and  reward-invariant  methods,  diagonalization  and  dilation  methods,  as  well  as  orthogonal  and 
over-complete  representations.  A  unifying  perspective  on  the  various  basis  construction  methods  emerges 
from  showing  they  result  from  different  power  series  expansions  of  value  functions,  including  the  Neumann 
series  expansion,  the  Laurent  series  expansion,  and  the  Schultz  expansion. 

The  research  also  develops  new  computational  algorithms  for  learning  to  solve  Markov  decision  pro¬ 
cesses  with  an  overcomplete  representation.  The  main  idea  is  to  use  a  Lt  regularized  penalty  function 
that  promotes  finding  sparse  representations  of  value  functions.  A  key  idea  investigated  in  this  research  is 
the  use  of  mirror-descent  optimization  methods  to  find  sparse  representations  of  value  functions.  A  new 
regularized  off-policy  bradient  temporal-difference  (TD)  method  is  proposed,  which  combines  off-policy  con¬ 
vergence,  robustness  to  irrelevant  features,  and  scalability  to  large  problems.  Experimental  results  show 
promising  performance. 
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